tech-pub

AI benchmarks are broken. Here’s what we need instead.

March 31, 2026 at 12:01 PMUpdated: Apr 11 Sources

TL;DR

AI models have long been evaluated by whether they beat individual humans on isolated tasks – chess, math, coding, essay writing. The 'AI vs. human' framing is catchy but misleading: it does not capture how AI performs in real, complex work environments. Current benchmarks get saturated fast – once a model tops a leaderboard, a new test is needed, without reflecting genuine capability gains.

Nauti's Take

It is an open secret in the AI industry: benchmarks get optimized, not capabilities. Models are trained or fine-tuned toward test sets until the numbers shine – what follows in practice is often disillusionment.

The 'beats humans' narrative is marketing, not measurement. What is genuinely missing are evaluations that assess whether AI systems reliably perform in specific professional contexts over weeks – not whether a model passes an SAT on a Tuesday.

Briefingshow

If benchmarks are broken, so are the investment decisions, regulatory approaches, and product promises built on top of them. Companies buy AI systems based on leaderboard rankings that say little about real-world utility. A new evaluation logic would shift the focus away from the competition narrative toward actual effectiveness – and with it, the broader debate about what AI can genuinely do.

Video

Sources

31.3.26

AI benchmarks are broken. Here’s what we need instead.

TL;DR

Nauti's Take

Video

Sources

From Our Newsletter