Why Are Large Language Models so Terrible at Video Games?
TL;DR
LLMs have failed to improve at video games despite rapid progress elsewhere – a rare exception: Gemini 2.5 Pro beat Pokémon Blue in May 2025.
Key Points
- That win came with caveats: far slower than a human player, bizarre repetitive mistakes, and reliance on custom scaffolding software.
- Julian Togelius, director of NYU's Game Innovation Lab and co-founder of AI testing firm Modl.ai, examined these limitations in a recent paper.
- Games demand real-time decisions, spatial reasoning, and long-horizon planning – precisely the areas where current language models fall short.
Nauti's Take
The Pokémon Blue win sounds impressive until you read that the model was slower than a first-grader with a Game Boy and kept repeating the same mistakes. That is not a breakthrough – it is a well-documented failure with an asterisk.
Togelius is right: LLMs are text machines optimized for token probabilities, not game objectives. Spatial memory, long-horizon state tracking, reactive decision-making – these are not problems you solve by adding more parameters.
Anyone expecting GPT-5 to simply get better at games is missing the point entirely.
Context
Video games are considered an ideal testbed for general AI capability because they combine clear rules, measurable progress, and complex decision spaces. The fact that LLMs barely improve here despite massive investment points to fundamental architectural gaps – not a benchmark issue, but a structural one. For the industry, this matters: anyone building on LLMs for autonomous game testing or NPC control needs to understand these limits.