Every agent benchmark that matters has the same secret ingredient: a cheap, objective way to know whether the agent was right. SWE-bench has the repo's own unit tests. τ-bench has the final database state. Terminal-bench has a command that either exits zero or doesn't. The oracle is the whole game — without it you can't grade at scale, and you certainly can't put a number on a leaderboard.
Deep research agents break this. Their job is to wander the open web, read a dozen pages, follow a lead, and come back with an answer. There is no test suite for "did you find the right fact," no database to diff. So the field did something clever, and then discovered the cleverness had a cost.
BrowseComp: make it hard to find, but easy to check#
OpenAI's BrowseComp (1,266 questions, April 2025) solves the oracle problem by inverting difficulty. Each question is built backwards from a known, obscure answer — a specific person, date, or paper buried behind several hops of cross-referencing — and then phrased so that no single search surfaces it. The answer is hard to find but trivially easy to verify: once you have "the 1997 paper by so-and-so," an exact-match check settles it instantly.
That inversion is what makes the benchmark work. It defeats memorization (the model can't have the fact pre-baked), it forces genuine multi-hop browsing (you have to chase the chain), and it stays cheap to grade (no LLM judge, just string match). The result is a spread that other benchmarks would kill for: plain GPT-4o scores about 1.9% even with browsing turned on; OpenAI o1, with no browser but stronger reasoning, reaches ~9.9% by inferring some answers from what it already knows; and Deep Research, purpose-built to search and synthesize, solves roughly 51.5%. The gap between 2% and 51% is not a difficulty knob — it is the benchmark cleanly separating "can search the web well" from "cannot."
The genius of BrowseComp is the asymmetry: the answer is a needle you can't guess, sitting in a haystack you can check in one second. That's the only construction that's simultaneously unfakeable and cheap to grade.
The confound hiding inside "easy to verify"#
Here is the cost. When an agent gets a BrowseComp answer right against the live web, you cannot tell why. Did its reasoning chain the clues correctly — or did its search backend just happen to return the exact page on the first query? You are grading the agent and its retriever as one inseparable blob, and the blob's score moves every time Google reranks or a page goes down. Two papers reporting BrowseComp numbers a month apart aren't even measuring the same world.
BrowseComp-Plus attacks exactly this. It keeps BrowseComp's questions but freezes the world: a curated corpus of roughly 100,000 documents with human-verified evidence pages and deliberately mined hard negatives. Every agent now searches the same fixed library, so the run is reproducible — and, crucially, you can swap the retriever while holding the reasoner constant. For the first time you can say "the agent reasons well but its retrieval is weak," or the reverse, instead of shipping one number that blames both. It is the same discipline that separates online from offline evaluation — stop letting two things that should be measured apart share one score.
The other family: grading the report, not the answer#
BrowseComp and its descendant grade a single short answer. But most people who deploy a research agent don't want a fact — they want a report. Grading that is a completely different problem, and it spawned a second benchmark family.
DeepResearch Bench (100 PhD-level tasks across 22 domains) grades the long-form deliverable with two LLM-judge metrics: RACE scores report quality — comprehensiveness, depth, coherence — against adaptive criteria, and FACT checks whether the report's claims are actually supported by the citations it provides, not just plausibly worded. Google's FRAMES sits nearby, bundling factuality, retrieval, and multi-hop reasoning into one set. These measure synthesis and grounding. BrowseComp measures retrieval. A model can ace one and flop the other, which is why — exactly as with SWE-bench versus τ-bench — you cannot read a single deep-research number and know what your agent is good at.
This is also where LLM-as-a-judge earns its keep and shows its risks: report-quality grading has no exact-match oracle, so the judge is the oracle, and its biases become the benchmark's biases. FACT exists precisely to anchor the soft judgment to something checkable — does the citation say what the report claims it says.
The metric none of them headline#
Pick the family that matches your deliverable: BrowseComp / BrowseComp-Plus when the job is finding specific verifiable facts and you care about retrieval reliability; DeepResearch Bench or FRAMES when the job is a synthesized, well-cited report. But notice what every one of these leaderboards measures: accuracy on questions that have answers.
The production failure mode is the opposite. A deep-research agent that can't find the answer and invents a plausible citation is far more dangerous than one that says "I couldn't verify this" — yet an accuracy-only score rewards the confident fabricator, because a guessed answer occasionally hits while an honest abstention always scores zero. The single most useful thing you can add to any of these benchmarks costs almost nothing: a held-out slice of genuinely unanswerable questions, scored on whether the agent abstains. Until a leaderboard does that, it is measuring how often your agent is right and saying nothing about how often it lies when it's stuck — which, for a research tool, is the number that decides whether anyone can trust the output.



