This is the most misunderstood graph in AI
TL;DR
METR (formerly ARC Evals) is the benchmark org that tests new frontier models from OpenAI, Google, and Anthropic for dangerous capabilities—before they ship.
Key Points
- Their most famous output: a bar chart showing how many autonomous replication and hacking tasks a model can solve. The AI community systematically misreads it.
- The chart doesn't show whether a model *is* dangerous, only whether it can complete certain sub-tasks—without context on success rate, cost, or real-world threat.
- METR itself warns: the graph is a research snapshot, not a safety certificate. Media and hype accounts ignore that.
Nauti's Take
The problem isn't the chart—it's that nobody reads the footnotes. METR does transparent research, but media and Twitter threads reduce complex evals to „Model X is safe” or „Model Y is dangerous”.
That's bullshit. A bar at 60% says nothing about cost, repeatability, or whether an attacker can actually exploit it.
As long as we treat benchmarks like sports stats, the debate stays shallow. METR provides raw data—the rest is interpretive work almost no one does.