Give an agent a job inside a piece of software and you immediately face a design choice that determines almost everything else: how does it see, and how does it act? There are two answers in production today, and the temptation is to treat them as a timeline — the clumsy DOM scripts of the past giving way to a model that just looks at the screen like we do. That framing is wrong, and it will lead you to the more expensive, less reliable tool for most of the work you actually have.
Two ways to perceive a screen
The vision or computer-use approach hands the model a screenshot. It reasons over the pixels, decides where to click, and emits a low-level action — move the cursor to a coordinate, click, type a string. Anthropic shipped the first frontier version in October 2024; as its docs put it plainly, Claude "looks at screenshots" and counts how many pixels it needs to move the cursor. OpenAI followed with Operator and the Computer-Using Agent in January 2025, exposed in the API as computer-use-preview. Google added Gemini 2.5 Computer Use in October 2025.
The DOM or browser-automation approach never looks at a picture. It reads the page's HTML or accessibility tree and acts on specific elements by reference — the same way an end-to-end test framework like Playwright does. Microsoft's Playwright MCP is the cleanest statement of the philosophy: it returns a structured accessibility snapshot with a stable [ref=N] handle per element, so the agent interacts deterministically with that button, not with a guess at where the button's pixels are.
Why the difference is the whole story
The two architectures fail in opposite places, and that's the decision.
A DOM snapshot is small and exact. Playwright MCP's accessibility tree is on the order of kilobytes of structured text — roughly a 10-to-100x reduction in input versus shipping a full screenshot every step — and the element references don't drift when the layout nudges two pixels left. That makes DOM agents cheaper, lower-latency, and more reliable wherever the DOM is clean, which describes most of the web.
Pixels buy you generality at the cost of accuracy. A vision agent works on anything with a screen: a legacy desktop app, a canvas- or WebGL-rendered interface with no meaningful DOM, a Citrix session, a flow that leaves the browser entirely. Nothing else can do those. But the price shows up on the benchmarks. On OSWorld, a suite of 369 real full-OS tasks, humans complete about 72% — and the strongest agents at launch were dramatically lower; Anthropic's first computer-use model scored 14.9% on the screenshot-only category (still nearly double the 7.8% of the prior best). OpenAI later reported its CUA at 38.1% on OSWorld against that same ~72% human bar. The numbers climb fast, but the gap is the evidence: driving arbitrary software by pixel coordinate is genuinely hard.
Vision is the only thing that works everywhere, which is exactly why it isn't the right default for the places something cheaper works better.
The frontier is hybrid, and the products already admit it
The strongest tell against "vision wins" is what the serious tools actually do. The DOM-native frameworks don't reject vision — they keep it as an escape hatch. browser-use (now north of 100k GitHub stars), Browserbase's Stagehand, and Skyvern all build on Playwright and reach for a screenshot only when structure fails them. And from the other direction, Google's newest computer-use model is a vision model that is deliberately scoped to the browser rather than the whole OS — a quiet concession that the unbounded pixel-control problem is still too hard to sell as a general product, while the browser is tractable.
So the production pattern is converging on the obvious synthesis: use the accessibility tree or DOM when it's available and clean, and fall back to pixels only when there's nothing structured to act on.
How to choose
- Automating clean web tasks at scale? Start with a DOM/browser agent. It's cheaper per step, faster, and more reliable, and the structured snapshot keeps your token bill sane. This is the right default for most work. If you're picking among them, that's a separate comparison of the browser-automation stack.
- Crossing into software with no clean DOM or API — desktop apps, canvas UIs, remote desktops? You need computer-use. It's the only thing that works there, and you accept higher cost and lower accuracy as the price of universality.
- Building something durable? Plan for both. Drive by structure where you can, drop to pixels where you must, and treat the vision model as the fallback, not the foundation.
The marketing wants this to be a story about agents finally seeing the screen the way you do. The benchmarks tell a duller, more useful story: seeing pixels is the general solution, and the general solution is rarely the reliable one. Pick reliability where you can afford to, and keep generality in reserve for the screens that give you no other choice.



