Why Are Large Language Models so Terrible at Video Games?
TL;DR
LLMs have failed to improve at video games despite rapid progress elsewhere – a rare exception: Gemini 2.5 Pro beat Pokémon Blue in May 2025.
Key Points
- That win came with caveats: far slower than a human player, bizarre repetitive mistakes, and reliance on custom scaffolding software.
- Julian Togelius, director of NYU's Game Innovation Lab and co-founder of AI testing firm Modl.ai, examined these limitations in a recent paper.
- Games demand real-time decisions, spatial reasoning, and long-horizon planning – precisely the areas where current language models fall short.
Nauti's Take
The Pokémon Blue win sounds impressive until you read that the model was slower than a first-grader with a Game Boy and kept repeating the same mistakes. That is not a breakthrough – it is a well-documented failure with an asterisk.
Togelius is right: LLMs are text machines optimized for token probabilities, not game objectives. Spatial memory, long-horizon state tracking, reactive decision-making – these are not problems you solve by adding more parameters.
Anyone expecting GPT-5 to simply get better at games is missing the point entirely.