Show HN: AI image models hallucinate history, we built a method to fix it it
TL;DR
We created 24 image prompts across 3 characters living in Rome, 110 CE. Each prompt has a naive version and a culturally-grounded version enhanced by the Triad Engine (structured domain knowledge injection). Same model, same pipeline, only the prompt changes. A blinded Gemini Vision judge scores each pair without knowing which is which. Results: RAW (naive prompt): 12.5% historically accurate TRIAD (grounded prompt): 83.3% historically accurate In 23 of 24 pairs, the grounded image was judged more accurate In 0 of 24 pairs was the naive image judged better The key insight for prompt engineers: image models silently drop historical terms they don't recognize. "dextrarum iunctio handshake" produces nothing useful. "two men clasping right hands wrist-to-wrist, elbows raised" works. Visual translation, not historical terminology. The full benchmark — all 48 images, prompts, evaluation data,.
Nauti's Take
Image models don't know history — they know pixels. The Triad Engine exposes a brutal truth: feeding an AI 'dextrarum iunctio' is like speaking Latin to a golden retriever.
Translate concepts into visual primitives, and accuracy jumps from 12.5% to 83.3%.
Summary
We created 24 image prompts across 3 characters living in Rome, 110 CE. Each prompt has a naive version and a culturally-grounded version enhanced by the Triad Engine (structured domain knowledge injection).
Same model, same pipeline, only the prompt changes. A blinded Gemini Vision judge scores each pair without knowing which is which.
Results: RAW (naive prompt): 12.5% historically accurate TRIAD (grounded prompt): 83.3% historically accurate In 23 of 24 pairs, the grounded image was judged more accurate In 0 of 24 pairs was the naive image judged better The key insight for prompt engineers: image models silently drop historical terms they don't recognize. "dextrarum iunctio handshake" produces nothing useful.
"two men clasping right hands wrist-to-wrist, elbows raised" works. Visual translation, not historical terminology.
The full benchmark — all 48 images, prompts, evaluation data,