NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model
NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model | NVIDIA Technical Blog

NVIDIA’s Nemotron 3 Nano Omni is a new 30B hybrid MoE model that unifies video, audio, image, and text reasoning into a single multimodal system, replacing fragmented model stacks. It improves agentic AI pipelines by enabling a shared perception-to-action loop, reducing orchestration complexity and lowering inference costs while increasing accuracy across benchmarks like MMlongbench-Doc, OCRBenchV2, WorldSense, and VoiceBench. Benchmarks show that it delivers significantly higher throughput—up to 9.2× for video reasoning and 7.4× for multi-document reasoning—at fixed interactivity thresholds compared to other open omni models. The architecture combines Mamba and transformer layers, 3D convolutions, and efficient video sampling to maintain high performance across modalities while supporting long-context, real-world agent workflows. NVIDIA provides fully open weights, datasets, and training recipes, enabling developers to customize and deploy the model across cloud, enterprise, and on-device environments with broad support from major runtimes and platforms.