ai-provider

Fine-tuning open LLM judges to outperform GPT-5.2

February 3, 2026 at 12:00 AMUpdated: Feb 71 Sources

TL;DR

Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment.

Nauti's Take

Finally concrete proof that open source isn't just 'good enough' but can directly beat closed models—precisely where it matters most: evaluating output quality. 5,400 training samples sounds almost laughably small, but that's exactly the point: efficient fine-tuning instead of brute-force scaling.

Anyone still convinced only big vendors can deliver reliable evaluation should take a close look at this.

Briefingshow

Evaluating LLMs has been costly in either time or money. The fact that a 120B open-source model beats GPT-5.2 with minimal training shows evaluation doesn't need to be expensive or proprietary. This makes LLM judges finally practical for smaller teams and research labs.

Summary

Sources

3.2.26

Fine-tuning open LLM judges to outperform GPT-5.2

#open-source #ai-safety

TL;DR

Nauti's Take

Summary

Sources

Related stories

From Our Newsletter