tech-pub

Why Are Large Language Models so Terrible at Video Games?

March 29, 2026 at 01:00 PMUpdated: Mar 301 Sources

TL;DR

LLMs have failed to improve at video games despite rapid progress elsewhere – a rare exception: Gemini 2.5 Pro beat Pokémon Blue in May 2025. That win came with caveats: far slower than a human player, bizarre repetitive mistakes, and reliance on custom scaffolding software. Julian Togelius, director of NYU's Game Innovation Lab and co-founder of AI testing firm Modl.ai, examined these limitations in a recent paper.

Nauti's Take

The Pokémon Blue win sounds impressive until you read that the model was slower than a first-grader with a Game Boy and kept repeating the same mistakes. That is not a breakthrough – it is a well-documented failure with an asterisk.

Togelius is right: LLMs are text machines optimized for token probabilities, not game objectives. Spatial memory, long-horizon state tracking, reactive decision-making – these are not problems you solve by adding more parameters.

Anyone expecting GPT-5 to simply get better at games is missing the point entirely.

Briefingshow

Video games are considered an ideal testbed for general AI capability because they combine clear rules, measurable progress, and complex decision spaces. The fact that LLMs barely improve here despite massive investment points to fundamental architectural gaps – not a benchmark issue, but a structural one. For the industry, this matters: anyone building on LLMs for autonomous game testing or NPC control needs to understand these limits.

Sources

29.3.26

Why Are Large Language Models so Terrible at Video Games?

TL;DR

Nauti's Take

Sources

From Our Newsletter