Put Skyvern and Browser Use side by side and they look like the same product with different logos. Both take a sentence — "log into this portal and download last month's invoices" — and turn it into real clicks in a real browser. Both are open source, both self-host, both run on whatever frontier model you point them at, and both will demo beautifully on a clean website. So the comparison gets filed as a feature bake-off: stars, integrations, who has a workflow builder. That framing produces a tidy table and the wrong decision, because the two tools disagree about something more fundamental than features. They disagree about how an agent should perceive a web page — and that single choice is the one that shows up on your bill, in your error logs, and in your lawyer's inbox.
Two ways to see a page#
Browser Use reads the page. On each step it walks the DOM and accessibility tree, builds a structured, indexed list of the interactive elements — the buttons, links, inputs, their labels — and hands the model that text. The model picks an element by index and acts. Vision is available as an add-on, but the load-bearing perception is the DOM. This is cheap and it is fast: a serialized element list is a few kilobytes of tokens, not a megapixel image, and the model reasons over clean structured labels instead of pixels. It is also, by construction, blind to anything the DOM doesn't honestly expose — a canvas element rendering a game, an unlabelled div soup, a widget that only means something visually.
Skyvern looks at the page. On each step it screenshots the viewport and reasons over the pixels with vision LLMs — a "swarm of agents," in its own framing, that comprehends the rendered page and maps what it sees to the action needed, over Playwright underneath. Because it re-looks every step, it doesn't care that the DOM is a disaster or that the layout shifted since yesterday; it cares what the page looks like right now. That is genuine robustness. It is also genuinely expensive: a full screenshot to a vision model on every single action, latency and tokens, multiplied by every step of a long workflow.
Browser Use bets the page is readable. Skyvern bets you have to look at it. Everything else — the cost, the failure mode, the fit — is the interest on that bet.
The bet decides the bill, the failure mode, and the fit#
Cost falls straight out of perception. A DOM element list is small and gets smaller as you trim it; a per-step screenshot is large and stays large. On a three-click task nobody notices. On a forty-step insurance intake, Skyvern's robustness is being purchased forty times over in vision tokens. If cost-per-run is your constraint and your target pages are DOM-clean, that math favors Browser Use before you've compared a single other feature.
The failure mode flips, too. Browser Use breaks where the DOM lies — unlabelled controls, canvas, dynamic re-renders that move the indices under it. Skyvern breaks where the render is ambiguous — two visually identical buttons, a vision model that misreads a low-contrast field. Neither is more reliable in the abstract. The honest question is which kind of broken your target sites produce. Scraping well-built marketing and SaaS pages? The DOM is fine; pay for Skyvern's eyes and you're buying insurance against a risk you don't carry. Automating a crusty county-government portal that was last restyled in 2009 and renders half its form in an image map? That's exactly the risk Skyvern's eyes are for.
And that is why the fit diverges. Browser Use is the cleaner general-purpose navigator and scraper — "visit, read, extract" across the open web. Skyvern is built around a narrower, deeper shape: the long, multi-page, structured form — government applications, insurance, onboarding flows — which is why it ships a workflow builder with loops and file parsing, a livestreamed viewport so a human can watch it work, and password-manager integrations (Bitwarden, 1Password). One tool optimizes for breadth of the web; the other for depth of a workflow. The ~100k versus ~22k GitHub-star gap isn't a quality verdict — it's breadth-of-use versus depth-of-use, made visible.
The line nobody reads until it's expensive#
Here is the dimension that never makes it into the feature table and should be near the top of it: the license. Browser Use is MIT. Skyvern is AGPL-3.0. If you are running either as an internal tool, this is a non-event. If you are building a closed, hosted product on top of one of them, it is the whole conversation. AGPL's copyleft reaches network use: when users interact over a network with a modified AGPL program, you can be obligated to offer them your corresponding source. Embedding a modified Skyvern as the silent engine inside your proprietary SaaS is therefore a deliberate legal decision, not a pip install. MIT asks you for nothing. Plenty of teams have picked their browser agent on benchmark screenshots and discovered the licensing layer only at the point where it's costly to switch.
So: which one#
Don't start from stars or the demo. Start from your pages. If they're reasonably well-built and the job is general navigation or scraping, Browser Use's DOM-first perception is cheaper, faster, and permissively licensed — that's your default. If your pages are visually messy, layout-volatile, or DOM-hostile, and the job is a long structured form you'd otherwise pay a human to fill, Skyvern's look-every-step perception is what you're actually buying, and it earns its token bill there. Both self-host; both are model-agnostic; both will pass the demo. The thing that won't pass quietly is the choice underneath — whether your agent reads the page or looks at it. Pick that first, and the rest of the table stops mattering. If you're weighing the lower-level plumbing instead — a managed browser sandbox versus your own — that's a different axis entirely, and worth keeping separate from this one.



