Fine-tuning open LLM judges to outperform GPT-5.2
TL;DR
Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment.
Nauti's Take
Finally concrete proof that open source isn't just 'good enough' but can directly beat closed models—precisely where it matters most: evaluating output quality. 5,400 training samples sounds almost laughably small, but that's exactly the point: efficient fine-tuning instead of brute-force scaling.
Anyone still convinced only big vendors can deliver reliable evaluation should take a close look at this.
Context
Evaluating LLMs has been costly in either time or money. The fact that a 120B open-source model beats GPT-5.2 with minimal training shows evaluation doesn't need to be expensive or proprietary. This makes LLM judges finally practical for smaller teams and research labs.
Summary
Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment