There are about a dozen ways to let an AI agent click a button on a website, and most of them are the same way wearing a different jacket. Three projects actually matter in 2026, and the useful thing about comparing them is not the feature checklist. It is that they each answer a more basic question differently: what is a web page, from the agent's point of view? Pixels? A DOM? A tree of named, clickable affordances? Pick wrong and you will pay for it in tokens, flakiness, or both.

The contenders

Python agent that drives a real browser via DOM extraction plus vision

The README's pitch is blunt: "Make websites accessible for AI agents." It is the most popular of the three by a wide margin, sitting around 99k stars, and it is pure Python, which is most of why. It hands the model a distilled view of the DOM, falls back on screenshots when structure is not enough, and lets the agent reason its way through whatever it finds. You give it a goal; it figures out the clicks.

TypeScript SDK adding act/extract/agent primitives on top of Playwright
★ 23kTypeScriptbrowserbase/stagehand

Stagehand (~23k stars, TypeScript) calls itself "the SDK for browser agents," and the honest description is that it is Playwright with three AI escape hatches bolted on: act() for a single natural-language action ("click the login button"), extract() for pulling structured data against a Zod schema, and agent() for multi-step autonomy. The point of the design is that you do not have to use any of them. The deterministic Playwright you already trust is right there.

Official MCP server exposing Playwright to any MCP client via the accessibility tree
★ 34kTypeScriptmicrosoft/playwright-mcp

Playwright MCP (~34k stars, TypeScript, from Microsoft) is the official Model Context Protocol server for Playwright. It exposes browser control as MCP tools, so any MCP client — Claude, Cursor, your own harness — can drive a browser without you wiring anything. Its defining choice is in the README: "Uses Playwright's accessibility tree, not pixel-based input," and "No vision models needed, operates purely on structured data."

The one thing that actually decides this

Forget stars. The axis that matters is how each tool represents the page to the model, because that representation is what you pay for and what breaks.

Playwright MCP feeds the model the accessibility tree — the same structured layer screen readers use. A button is a node labeled "Submit," not a 40x18 rectangle the model has to locate in a screenshot. That is dramatically cheaper. A vision pass on a full-page screenshot can run into thousands of image tokens per step; an a11y snapshot of the relevant region is text, often an order of magnitude less, and it is deterministic — the same page yields the same tree, so the model's target does not drift between runs.

browser-use sits at the other end. It leans on the DOM and, when needed, vision, and it spends tokens to buy resilience. When a site renders something the accessibility tree describes badly — a canvas widget, a custom drag interaction, an unlabeled icon grid — the agent can still look at the thing and reason about it. You pay more per step and you accept more nondeterminism, in exchange for an agent that does not faceplant the first time it meets a UI nobody labeled properly.

The accessibility tree is cheaper and more repeatable until the page lies to it; vision is expensive and wobbly until it's the only thing that works.

Stagehand's actual innovation is refusing to pick. Because it is Playwright underneath, the boring 80% of any automation — navigate to URL, fill known form, click known selector — stays as deterministic code with zero model calls. You reach for act() or extract() only at the spots where the page is genuinely unpredictable or the selector keeps changing. The AI is a targeted tool, not the engine. For a long-running production scraper, that is the difference between a few cents and a few dollars per run, and between a job that reruns identically and one that improvises every time.

So which one

If you are building inside an MCP client and want a browser with the least wiring and the lowest token floor, Playwright MCP is the obvious default. The a11y-tree approach is the right bet for the large majority of mainstream, well-built sites, and "official Microsoft project" is not nothing when you are choosing a dependency to live with.

If your target sites are hostile, weird, or visually-driven — the kind where labels are missing and structure is a suggestion — browser-use earns its token cost. It is also the friendliest if your stack is Python and you want an agent that just goes.

If you are a TypeScript team automating known workflows that occasionally hit a soft spot, Stagehand is the adult choice. Deterministic where you can be, intelligent only where you must be. The model is a scalpel, not a hammer.

The mistake is treating all three as interchangeable "agent gets a browser" libraries and choosing on star count. They are three different answers to what a page is, and your bill — in dollars and in 3am debugging — is mostly decided by which answer fits the pages you actually have to automate.