Two Ways to Show an Agent a Web Page

A year ago "can the agent use a browser" was a research question. It is now a settled one, and the libraries that settled it have quietly split into two camps that disagree about something more interesting than whether the click lands. They disagree about what a web page is when you hand it to a model.

One camp says a page is a tree. The browser already maintains an accessibility tree — every button, field, and link with a role and a name, the same structure a screen reader walks. Feed the agent that, and it reasons over button "Submit" instead of guessing at a rectangle of pixels. It's deterministic, it's cheap in tokens, and it never hallucinates a control that isn't there because it's reading the controls that are.

The other camp says a page is an image. Render it, screenshot it, and let a vision model find the thing a human would click. This is slower and more expensive, and it's the only thing that works when the "page" is a <canvas>, a PDF viewer, a video game, or a site engineered specifically so the DOM tells you nothing. The tree can be a lie. The pixels are the ground truth a user actually sees.

Pick your camp and you've picked your failure mode: the agent that can't see what isn't in the markup, or the agent that sees everything and is sure about none of it.

The accessibility-tree camp

▟ microsoft/playwright-mcp

An MCP server that gives any agent a browser through Playwright's accessibility tree instead of screenshots — structured, deterministic, no vision model required

★ 34.1kTypeScriptmicrosoft/playwright-mcp

This is the cleanest statement of the tree philosophy, and the fact that it ships as an MCP server is the whole point: you bolt it onto an agent you already have rather than building a new one. The README's pitch is "accessibility tree, not pixel-based input," and the consequence is that the same task costs a fraction of the tokens and reruns identically. The catch is honesty about scope — when a site renders its UI to a canvas, the tree is empty and the agent is blind.

▟ browserbase/stagehand

A framework with act(), extract(), and agent() that lets you mix natural-language steps with plain Playwright code, and caches resolved actions so repeat runs skip the model

★ 23.2kTypeScriptbrowserbase/stagehand

Stagehand is the camp's answer to the real production complaint, which is not capability but predictability. Pure agents drift; deterministic scripts shatter the moment a class name changes. Stagehand's bet is the boring middle: use the model to resolve a step once, cache the resolution, and replay it without inference until the page actually changes. It is the closest thing here to an admission that you do not want an LLM in your hot path more than you have to.

The pixels camp

▟ browser-use/browser-use

Gives a model a real browser action space with persistent tools and recovery loops; pairs DOM extraction with vision and works across LLM providers

★ 99.7kPythonbrowser-use/browser-use

The star count — by a wide margin the most of any project here — tells you which camp captured the Python-agent mainstream. browser-use is the maximalist option: a genuine computer action space, recovery loops borrowed from coding agents, and a posture of "make websites accessible for AI agents" by whatever combination of DOM and vision the moment requires. It is the framework you reach for when you want the agent to figure it out, and you accept the bill and the variance that come with that freedom.

▟ Skyvern-AI/skyvern

Drives browsers with vision LLMs instead of XPath selectors, so one workflow runs across sites it has never seen and survives layout changes

★ 21.9kPythonSkyvern-AI/skyvern

Skyvern is the purest expression of the pixels thesis: no brittle selectors, just visual elements mapped to actions, which is what lets a single workflow generalize across a hundred vendor portals that all do the same thing differently. That generality is the entire value proposition for the unglamorous work — insurance forms, government sites, procurement flows — where the DOM is a different mess on every domain. (Note the AGPL-3.0 license; the tree-camp projects are MIT and Apache-2.0, which matters more than it sounds if you're embedding.)

What the split is really about

The click was never the hard part. All four of these will fill a form. The hard part is trust — when the run fails, can you tell why, and will the fix still hold next week? The tree camp buys you a legible, replayable trace and pays in coverage. The pixel camp buys you universal coverage and pays in a model that is confident about a screenshot you now have to second-guess.

The tell, across all four READMEs, is how much of the engineering has migrated away from acting and toward accountability: caching so the model runs less, recovery loops for when it's wrong, accessibility snapshots you can diff. The action space is solved. The market has moved on to the harder question of whether you can stand behind what the agent did — which, for anything touching a real account or a real dollar, is the only question that was ever going to matter.

Two Ways to Show an Agent a Web Page

The accessibility-tree camp

The pixels camp

What the split is really about

Dex Mareno

Dispatches from the machines, in your inbox

Two Ways to Show an Agent a Web Page

The accessibility-tree camp

The pixels camp

What the split is really about

Dex Mareno

Continue reading

Your Container Is Not A Sandbox

The Evals Are the Product

Memory Stopped Being a Layer

Dispatches from the machines, in your inbox