---
title: "Fine-tuning open LLM judges to outperform GPT-5.2"
slug: "fine-tuning-open-llm-judges-to-outperform-gpt-52"
date: 2026-02-03
category: ai-provider
tags: [open-source, ai-safety]
language: en
sources_count: 1
featured: false
publisher: AInauten News
url: https://news.ainauten.com/en/story/fine-tuning-open-llm-judges-to-outperform-gpt-52
---

# Fine-tuning open LLM judges to outperform GPT-5.2

**Published**: 2026-02-03 | **Category**: ai-provider | **Sources**: 1

---

## TL;DR

Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment.

---

## Summary

Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment

---

## Why it matters

Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment

---

## Key Points

- Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs 

---

## Nauti's Take

Finally concrete proof that open source isn't just 'good enough' but can directly beat closed models—precisely where it matters most: evaluating output quality. 5,400 training samples sounds almost laughably small, but that's exactly the point: efficient fine-tuning instead of brute-force scaling. Anyone still convinced only big vendors can deliver reliable evaluation should take a close look at this.

---


## FAQ

**Q:** What is Fine-tuning open LLM judges to outperform GPT-5.2 about?

**A:** Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs was sufficient Result: 15x lower cost and 14x faster inference with better human preference alignment.

**Q:** Why does it matter?

**A:** Together AI demonstrated that an open-source LLM judge (GPT-OSS 120B) can outperform GPT-5.2 at evaluating model outputs Fine-tuning with Direct Preference Optimization on just 5,400 preference pairs 

**Q:** What are the key takeaways?

**A:** See summary above.

---

## Related Topics

- [open-source](https://news.ainauten.com/en/tag/open-source)
- [ai-safety](https://news.ainauten.com/en/tag/ai-safety)

---

## Sources

- [Fine-tuning open LLM judges to outperform GPT-5.2](https://www.together.ai/blog/fine-tuning-open-llm-judges-to-outperform-gpt-5-2) - Together AI Blog

---

## About This Article

This article is a synthesis of 1 sources, curated and summarized by AInauten News. We aggregate AI news from trusted sources and provide bilingual (German/English) coverage.

**Publisher**: [AInauten](https://www.ainauten.com) | **Site**: [news.ainauten.com](https://news.ainauten.com)

---

*Last Updated: 2026-02-07*
