The "deep research" feature — point an agent at a hard question, walk away, come back to a long, cited report — was a closed product a year ago. (If you're fuzzy on the category itself, start with what "deep agents" actually are.) Now it's a crowded open-source category, and several of the implementations are good enough to put real work through. The pattern underneath them is almost always the same: a planner decomposes the question into sub-questions, a set of searchers gather sources in parallel, the agent reads and recurses on promising leads, and a writer synthesizes a report with citations. What differs — and what you should choose on — is the stack, where it runs, and how much of that loop is exposed to you as configuration.
Here are seven worth knowing, from the smallest reference to the leaderboard-toppers.
The standalone tools#
The most-starred standalone in the category, and the most "just run it." Its Deep Research mode is a recursive tree exploration with configurable depth, breadth, and concurrency — a full run lands around five minutes and well under a dollar on a small reasoning model. If you want a deep-research tool rather than a deep-research framework, start here.
OWL's clever move is lazy browser use: it decides per step whether a cheap tool (search, code execution, an Arxiv or GitHub toolkit) is enough and only spins up a real browser when a page genuinely needs interaction. Browsers are the slowest, most expensive, most failure-prone tool in any research agent — treating them as a last resort is why OWL is both fast and near the top of GAIA.
The hackable bases#
The standout feature is honesty about architecture. The repo keeps its older designs — plan-and-execute, supervisor-researcher multi-agent — in src/legacy/ and shows the current single-loop design beats them on DeepResearch Bench. You get a base you can A/B real architectural choices in, not just a black box that happens to work.
Read this one before you build anything. Its whole behavior is governed by two explicit knobs — depth (how many times it recurses, default 2) and breadth (how many parallel queries per level, default 4). That's the single most important design idea in the category made literal: the breadth-versus-depth tradeoff that controls both your bill and your report quality should be a config value you set, not an emergent property of a prompt.
The specialists#
The privacy pick. Nothing leaves the machine: the model is local and search backends are pluggable (DuckDuckGo, SearXNG, Tavily, Perplexity). Because it ships as a LangGraph Studio graph, the gap-detection loop is inspectable node-by-node — useful when a local model goes off the rails and you need to see where.
Most repos here are scripts; this is an app. It leans on Firecrawl's extract with JSON-schema-validated outputs to turn scraped pages into typed data, and supports reasoning models across providers. Reach for it when the deliverable is a product, not a notebook.
The most interesting architectural bet. By having the agent write code to orchestrate its tools, smolagents' open replication reached 55% pass@1 on GAIA's validation set and topped the open leaderboard — against roughly 67% for OpenAI's original. That gap is the clearest published measure of how far open deep-research has come, and how far it still has to go.
If you only do one thing with this list: clone dzhng/deep-research, read all 500 lines, and notice how much of "deep research" is just a recursion with two well-chosen knobs. Then pick the heavier repo that matches your stack. And benchmark before you trust any of them — DeepResearch Bench (100 PhD-level tasks, scored for both report quality and citation support) exists precisely because a confident, well-formatted report is the easiest thing in the world for an agent to fake. For the how, see our guide to evaluating a deep-research agent.



