A deep research agent does not return an answer. It returns a report — pages of structured, confident, well-cited prose on a question that has no single correct response. That is exactly what makes it useful, and exactly what makes it hard to grade. You cannot diff it against a key, because there isn't one. You cannot run a unit test, because "a good market analysis of solid-state batteries" is not a value that equals true or false. So most teams reach for the only tool that scales — hand the report to an LLM judge, ask "how good is this, 1 to 10," and ship the average.

That number is worse than useless, because it answers the wrong question. It tells you how the report reads. It tells you nothing about whether a single sentence in it is true.

The output is a report, so the evaluation has two axes#

The 2026 deep-research benchmarks all converged on the same structural insight: a research report has to be scored on two axes that have almost nothing to do with each other. One is the quality of the writing-and-reasoning. The other is whether the claims are grounded in sources that actually say what the report says they say. A report can ace the first and fail the second, and when it does, it is the most dangerous artifact a research agent can produce: plausible, comprehensive, fluent, and wrong.

Axis one — report quality. DeepResearch Bench (Du et al., 2025) grades this with a framework called RACE: Reference-based Adaptive Criteria-driven Evaluation. The "adaptive criteria" part is the load-bearing idea. Instead of one fixed checklist applied to every report, RACE generates a task-specific rubric for each question and scores the report against a strong reference along four dimensions — Comprehensiveness, Insight/Depth, Instruction-Following, and Readability. A fixed rubric can't fairly compare a report on tax policy to one on protein folding; a regenerated rubric can. The benchmark itself is 100 PhD-level tasks, split 50 English and 50 Chinese across 22 fields, which is small but deliberately hard — and, like BrowseComp and the broader deep-research benchmark field, built specifically for agents that browse, gather, and synthesize rather than answer from memory.

Axis two — grounding. This is the one teams skip, and it splits again into two numbers that get conflated constantly:

Citation accuracy is precision — of the sources you cited, how many actually back the claim. Citation coverage is recall — how many supported facts you surfaced at all. They are not the same axis, and the best agents prove it.

On the DeepResearch Bench leaderboard, Gemini-2.5-Pro Deep Research leads the overall RACE score (48.88) and surfaces the most effective citations — roughly 111 verifiably-supported facts per task. But the highest citation accuracy belongs to Perplexity Deep Research, at 90.24%. One agent finds the most. A different agent is the most trustworthy about what it found. If you collapse those into a single "research quality" score, you erase precisely the distinction a person relying on the report needs most.

Build the rubric from how the tool is actually used#

Perplexity's DRACO benchmark makes the same two-axis split but sources its tasks differently — from millions of real Deep Research production requests, sampled and then run through a five-stage pipeline that strips personal information, adds scope, filters for objectivity and difficulty, and ends in domain-expert review. Each of its 100 tasks carries a bespoke rubric averaging ~40 criteria across four axes: factual accuracy, breadth and depth, presentation quality, and citation quality. The detail worth stealing isn't the leaderboard (Perplexity, unsurprisingly, reports leading three of four dimensions on its own benchmark — read that with the appropriate eyebrow). It's the construction: rubrics grounded in real queries, written by experts, regenerated per task.

That construction also solves the quiet problem with every fixed benchmark: it rots. A frozen answer key is wrong about a fast-moving topic within months (temporal drift), and the moment a benchmark is public, its tasks leak into the next training run and scores inflate (contamination). Regenerating rubrics per task slows the first; keeping the hardest evaluation human — arena-style pairwise voting, as in DR-Arena and the Deep Research Comparator — sidesteps the second. A deep-research eval is not a thing you build once.

What to actually do on Monday#

You will not stand up a 100-task expert benchmark this week. You don't need to. Steal the structure:

The temptation with deep research is to trust the artifact because it looks like work. It looks like a report a competent analyst would hand you, so the instinct is to grade it like prose. Grade it like evidence instead. The question is never "is this well-written." The question is "if I act on the third paragraph, will the source hold."