ai-provider

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

February 4, 2026 at 03:00 PMUpdated: Mar 201 Sources

TL;DR

NVIDIA releases Nemotron ColEmbed V2, a multimodal retrieval model that processes text and images together Achieves #1 ranking on the ViDoRe V3 benchmark for visual document retrieval tasks Built on late-interaction architecture (ColBERT) using token-level similarities instead of single embeddings Available open source under Apache 2.0 license on Hugging Face.

Nauti's Take

NVIDIA delivers solid engineering work here, not marketing fantasy. The ViDoRe benchmark is young (V3 just became established), but #1 is #1. The architecture is particularly interesting: late interaction scales worse than single-vector embeddings, but captures nuances that get lost in compressed vectors.

The model has ~1.2B parameters, relatively small, yet beats larger competitors. Apache 2.0 means: no license traps, real usability.

The question remains how well it performs outside benchmark PDFs – but the code is there, anyone can test it.

Briefingshow

Retrieval is the backbone of modern RAG systems – better search means better answers. Previous models often treated text and images separately or superficially. Nemotron ColEmbed V2 demonstrates that late-interaction approaches (comparing tokens individually rather than compressing everything into one vector) work significantly more precisely with visual documents.

For companies with PDF, presentation, or scan archives, this could finally mean usable semantic search.

Summary

Sources

4.2.26

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

TL;DR

Nauti's Take

Summary

Sources

From Our Newsletter