Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model
TL;DR
NVIDIA releases Nemotron ColEmbed V2, a multimodal retrieval model that processes text and images together.
Key Points
- Achieves #1 ranking on the ViDoRe V3 benchmark for visual document retrieval tasks
- Built on late-interaction architecture (ColBERT) using token-level similarities instead of single embeddings
- Available open source under Apache 2.0 license on Hugging Face
Nauti's Take
NVIDIA delivers solid engineering work here, not marketing fantasy. The ViDoRe benchmark is young (V3 just became established), but #1 is #1. The architecture is particularly interesting: late interaction scales worse than single-vector embeddings, but captures nuances that get lost in compressed vectors.
The model has ~1.2B parameters, relatively small, yet beats larger competitors. Apache 2.0 means: no license traps, real usability.
The question remains how well it performs outside benchmark PDFs – but the code is there, anyone can test it.