LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything:

Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

NVIDIA’s LocateAnything is an impressive leap forward in vision language grounding, delivering both speed and accuracy through its new Parallel Box Decoding (PBD) approach. Instead of generating bounding boxes token-by-token like traditional VLMs, the model predicts each box as a complete atomic unit, enabling dramatically faster inference while preserving geometric consistency.

The system supports a wide range of localization tasks—document understanding, GUI grounding, dense object detection, OCR, and more under a unified framework. Its hybrid inference mode smartly switches between fast parallel decoding and slower autoregressive decoding when needed, maintaining both speed and robustness.

Performance numbers are strong: LocateAnything achieves up to 2.5× higher throughput than previous models while also improving high‑IoU accuracy across benchmarks like LVIS, COCO, and ScreenSpotPro. The gains are supported by a massive training dataset—138M queries and 785M boxes—spanning general object detection, GUI elements, text localization, and more .

Overall, LocateAnything stands out as a fast, scalable, and highly capable grounding model that pushes the speed-accuracy frontier forward in a meaningful way.

Source: research.nvidia.com/labs/lpr/locate-anything/

Code: https://github.com/NVlabs/Eagle/tree/main/Embodied

May 29, 2026

0 Comments

Inline Feedbacks

View all comments

Request a Quote

Log In

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything:

Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything