There is a benchmark you are probably using to pick an open vision-language model for your agent, and it is the wrong one. MMMU — the broad multimodal exam everyone quotes — measures whether a model can understand an image well enough to answer a question about it. That is a fine thing to measure for a chatbot. It is close to irrelevant for an agent, because an agent does not answer questions about the screenshot. It has to act on it. And acting means producing a coordinate: a bounding box around the "Submit" button, a point on the right cell of the invoice, a click target inside a dense professional UI. The skill is called grounding, and it is where the open field separates in ways that the headline scores hide.
Description is not action
The clean way to see the gap is to put two models of the same size next to each other on a grounding benchmark instead of a comprehension one. ScreenSpot-Pro does exactly this: 1,581 instructions across 23 professional applications — Photoshop, AutoCAD, IDEs — where the model must return the location of a target UI element in a high-resolution screenshot. It is unforgiving, because the targets are small and the screens are crowded, which is precisely the regime real desktop agents operate in.
On that benchmark, H Company's Holo1.5 reports its 7B model at 57.94% against the generalist Qwen2.5-VL-7B at 29.00%. Same parameter budget. Roughly double the localization accuracy. Holo1.5 gets there not by being a better all-round model but by being fine-tuned on the one thing — GUI localization and UI question-answering — on top of a Qwen2.5-VL base. Its 72B variant pushes localization into the 80s and reports beating not just open generalists but specialized systems like UI-TARS and even a closed generalist (Claude Sonnet 4) on these UI tasks. That is the whole argument in one data point: a model can be mediocre at describing a screen and excellent at clicking it, or the reverse.
An agent never asks the screenshot a question. It asks for a coordinate — and a model that scores 70 on MMMU can still miss the button by forty pixels.
The generalists are real, and they ground natively
This does not mean you need a niche model. The strongest open generalists have made grounding a first-class output, not an afterthought. Qwen3-VL — released across 2B through the 235B-A22B mixture-of-experts flagship between September and November 2025, under a clean Apache-2.0 license with a native 256K context window expandable toward 1M — emits 2D and 3D bounding boxes and point coordinates directly, and its larger Instruct variants sit at the top of the general ScreenSpot grounding leaderboard. For most teams this is the right default: it reads documents and charts as well as anything open, and it can act on what it reads without a second model.
InternVL3.5 (also Apache-2.0, built on a Qwen3 language backbone, released August 2025) is the generalist to reach for when multimodal reasoning and document QA dominate your workload. It reports MMMU around 73.4 at 8B and into the high 70s at its 241B-A28B MoE scale — frontier territory for open weights — and it is evaluated explicitly on agentic and GUI-grounding suites including ScreenSpot and OSWorld-G, so the grounding is measured, not assumed. The honest caveat is the one this whole piece is about: a high MMMU number tells you it will reason well about the page. It does not, by itself, promise pixel-accurate clicks. Check the grounding column separately.
When the click is the product
There is a class of agent where grounding is not one feature among many — it is the entire job. A computer-use agent that drives a browser or a desktop spends most of its turns doing one thing: look at a screenshot, decide where to click. Here the case for a specialist is strongest. The Surfer-H / Holo1 work makes the economic version of the argument: an open-weights action model, tuned for localization, can run a cost-efficient web agent that competes with far larger generalist stacks, because accuracy on the click is what converts into task success and you are not paying frontier-API rates per step. If your agent's failure logs are full of "clicked the wrong element," the fix is rarely a bigger general model. It is a model that was trained to localize.
The trade you accept is breadth. A GUI specialist inherits its base model's document and chart ability but is optimized for screens; for a system that must both read a 40-page contract and operate a UI, the cleaner architecture is often a strong generalist (Qwen3-VL) as the default with a specialist invoked only for the localization-critical steps — the same route-by-difficulty logic agents already use for reasoning.
On the edge, grounding still comes first
The on-device tier confirms the spine rather than breaking it. Moondream 3 (a 9B mixture-of-experts with roughly 2B active parameters, a 32K context, released September 2025) is built around native vision skills — pointing, object detection, bounding boxes, segmentation — as model outputs, not bolted-on prompts. It is small and fast enough for realtime and edge deployment, and it leads with grounding precisely because that is what a small agentic vision model is for. The one thing to read carefully is the license: Moondream 3 ships under a Business Source License with a use grant, not the Apache-2.0 of the Qwen and InternVL lines — fine for most products, worth a lawyer's glance if you are reselling.
The recommendation, by job
- Document RAG / read-and-cite agents: Qwen3-VL as the default — Apache-2.0, long context, strong OCR and charts, native bounding boxes so it can point at the region it cites. Pair it with a visual retriever like ColPali rather than text chunking. Reach for InternVL3.5 when the work is reasoning-heavy.
- Computer-use / UI agents: lead with grounding. A GUI-tuned specialist like Holo1.5 earns its keep when click accuracy on dense, professional UIs decides whether the task completes. Validate on ScreenSpot-Pro or your own UI, not MMMU.
- On-device / realtime: Moondream 3 — grounding-native and small — provided its license fits your distribution.
The meta-point outlasts every version number in this piece. Models churn; the Qwen-VL and InternVL lines will ship a new generation before this sentence is stale. What does not change is the selection criterion. For an agent, evaluate the model on the artifact it must actually produce — a coordinate — and treat the comprehension leaderboard as the marketing it is.
All parameter counts, licenses, context lengths, and benchmark figures above are drawn from the vendors' own model cards, repositories, and technical reports as of 2026-06-25, and are attributed as such; ScreenSpot-Pro figures are H Company's published numbers. No live leaderboard standings are quoted, as they move week to week.



