3 / 556

Why Are Large Language Models so Terrible at Video Games?

TL;DR

LLMs have failed to improve at video games despite rapid progress elsewhere – a rare exception: Gemini 2.5 Pro beat Pokémon Blue in May 2025.

Key Points

  • That win came with caveats: far slower than a human player, bizarre repetitive mistakes, and reliance on custom scaffolding software.
  • Julian Togelius, director of NYU's Game Innovation Lab and co-founder of AI testing firm Modl.ai, examined these limitations in a recent paper.
  • Games demand real-time decisions, spatial reasoning, and long-horizon planning – precisely the areas where current language models fall short.

Nauti's Take

The Pokémon Blue win sounds impressive until you read that the model was slower than a first-grader with a Game Boy and kept repeating the same mistakes. That is not a breakthrough – it is a well-documented failure with an asterisk.

Togelius is right: LLMs are text machines optimized for token probabilities, not game objectives. Spatial memory, long-horizon state tracking, reactive decision-making – these are not problems you solve by adding more parameters.

Anyone expecting GPT-5 to simply get better at games is missing the point entirely.

Sources