AI benchmarks are broken. Here’s what we need instead.
TL;DR
AI models have long been evaluated by whether they beat individual humans on isolated tasks – chess, math, coding, essay writing.
Key Points
- The 'AI vs. human' framing is catchy but misleading: it does not capture how AI performs in real, complex work environments.
- Current benchmarks get saturated fast – once a model tops a leaderboard, a new test is needed, without reflecting genuine capability gains.
- Researchers are calling for evaluation frameworks that measure system performance in real-world workflows, not point-in-time tasks against a human baseline.
Nauti's Take
It is an open secret in the AI industry: benchmarks get optimized, not capabilities. Models are trained or fine-tuned toward test sets until the numbers shine – what follows in practice is often disillusionment.
The 'beats humans' narrative is marketing, not measurement. What is genuinely missing are evaluations that assess whether AI systems reliably perform in specific professional contexts over weeks – not whether a model passes an SAT on a Tuesday.
Context
If benchmarks are broken, so are the investment decisions, regulatory approaches, and product promises built on top of them. Companies buy AI systems based on leaderboard rankings that say little about real-world utility. A new evaluation logic would shift the focus away from the competition narrative toward actual effectiveness – and with it, the broader debate about what AI can genuinely do.