The hosted versions arrived first — OpenAI and Google both shipped a "Deep Research" button that goes away for a few minutes and comes back with a cited report. The open-source ecosystem answered fast, and within months there were a dozen projects with nearly identical pitches: give it a question, it researches the web, it writes you a report. As with agent frameworks, that surface similarity hides the only decision that matters. These projects do not differ on what they produce. They differ on how the research loop is driven — and one project ran the benchmark that proves the loop structure is the whole game.
Three answers to the same question
GPT Researcher is a pipeline. It does not free-run; it follows a fixed shape. A planner agent decomposes your query into research sub-questions, then execution agents scrape the relevant sources concurrently — parallelized with asyncio — summarize each with source tracking and relevance filtering, and aggregate the lot into a long-form cited report. Search defaults to Tavily, the LLM is provider-agnostic, and a recursive "Deep Research" mode adds a configurable breadth/depth tree when you want to go deeper. It is the most predictable of the three precisely because the loop is wired, not improvised.
LangChain's Open Deep Research is a graph. Built on LangGraph, it runs a supervisor: a research supervisor scopes the brief and delegates to research sub-agents, each working in an isolated context window, spawning more for depth, before a write phase synthesizes the result. LangChain is explicit that this is not the classic ReAct loop but a reflection-based supervisor pattern. The payoff is configurability — models are set per role, and search is pluggable across Tavily, native provider web search, and MCP servers — and observability, because every node is a step you can trace.
Hugging Face's Open Deep Research is an agent that writes code. It is built on smolagents, whose CodeAgent emits its actions as executable Python rather than JSON tool calls. The deep-research example wires a manager CodeAgent to a managed web-browser agent whose tools are a text browser — search, visit, page up/down, find, archive lookup. The agent decides what to do next by writing a Python snippet that calls those tools, runs it, and reads the result.
The three projects produce the same artifact — a cited report. They disagree on who is allowed to improvise the path to it.
The benchmark that settles the argument
Most of this space ships without numbers. GPT Researcher publishes cost and latency from its own runs; LangChain's project reports a mid-tier placing on Deep Research Bench using the RACE metric. Useful, but neither isolates the variable I care about: does the control structure actually change quality, or is it taste?
Hugging Face ran the experiment that answers it. Their Open Deep Research scored 55.15% on the GAIA validation set (OpenAI's hosted Deep Research scored 67.36% on the same set, for scale). Then they did the one thing the other projects didn't: they held the agent fixed and swapped only the action format — from code to JSON tool calls. Performance collapsed to about 33%.
Same model, same tools, same task. Switch the agent from writing code to emitting JSON, and a third of the score evaporates.
That is the non-obvious result. The 22-point gap is not a bigger model or a better prompt; it is the loop's control structure. Letting the agent express a multi-step action as a single Python snippet — loop over search results, branch on what it finds, compose tool calls — is more expressive than forcing each step through a JSON envelope. The structure of how the agent acts is doing the work. It is the same lesson the ReAct vs plan-and-execute vs reflexion debate keeps circling, now with a clean number attached.
Choosing by the loop, not the logo
So pick the control structure your problem wants.
If you want a dependable report generator you can drop into a product — predictable cost, predictable shape, the least surprising behavior — GPT Researcher's pipeline is the safe default, and its Tavily-backed search and provider-agnostic LLM layer make it easy to slot in. If you intend to customize — swap models per role, add MCP tools, trace every hop, reshape the supervisor — LangChain's Open Deep Research gives you a graph built to be edited, and inherits the LangGraph observability story. If you want maximum agentic autonomy and are willing to sandbox arbitrary code execution to get it, Hugging Face's Open Deep Research is the most capable of the three on the one benchmark anyone here has actually published — and it earns that score from its structure.
The mistake is picking by star count. GPT Researcher and smolagents sit within a few hundred stars of each other; LangChain's is younger and smaller. None of that tells you how the loop runs, and the loop is the only thing here you cannot change after you commit.



