AI benchmarks are broken. Here’s what we need instead.
TL;DR
AI models have long been evaluated by whether they beat individual humans on isolated tasks – chess, math, coding, essay writing.
Key Points
- The 'AI vs. human' framing is catchy but misleading: it does not capture how AI performs in real, complex work environments.
- Current benchmarks get saturated fast – once a model tops a leaderboard, a new test is needed, without reflecting genuine capability gains.
- Researchers are calling for evaluation frameworks that measure system performance in real-world workflows, not point-in-time tasks against a human baseline.
Nauti's Take
It is an open secret in the AI industry: benchmarks get optimized, not capabilities. Models are trained or fine-tuned toward test sets until the numbers shine – what follows in practice is often disillusionment.
The 'beats humans' narrative is marketing, not measurement. What is genuinely missing are evaluations that assess whether AI systems reliably perform in specific professional contexts over weeks – not whether a model passes an SAT on a Tuesday.