1 / 192

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

TL;DR

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking […] The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

Nauti's Take

The release of Phi-4-reasoning-vision-15B marks a significant milestone in multimodal AI research. This 15 billion parameter model showcases impressive vision-language capabilities, but its true potential lies in the lessons learned from its training.

Microsoft's decision to open-weight the model is a welcome move, but the real challenge lies in its practical applications. Can Phi-4-reasoning-vision-15B live up to its promise and deliver tangible value in real-world scenarios?

Sources