126 / 133

Fine-tuning open LLM judges to outperform GPT-5.2

TL;DR

Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment.

Nauti's Take

Finally concrete proof that open source isn't just 'good enough' but can directly beat closed models—precisely where it matters most: evaluating output quality. 5,400 training samples sounds almost laughably small, but that's exactly the point: efficient fine-tuning instead of brute-force scaling.

Anyone still convinced only big vendors can deliver reliable evaluation should take a close look at this.

Summary

Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment

Sources