Three benchmarks dominate every "is this agent any good" conversation: SWE-bench, τ-bench, and GAIA. They get arranged like a difficulty ladder — easy, medium, hard — and a team picks the one whose leaderboard flatters their model. That's a category error. They don't measure the same thing at three difficulties. They measure three different things, and a strong score on one is nearly silent about the other two.

What each one actually grades

SWE-bench (Princeton, arXiv 2310.06770) asks one question: can the agent produce a verifiable artifact? You hand it a real codebase and a real GitHub issue, it generates a patch, and the grading is execution-based — the repo's own unit tests, including the fail-to-pass tests, either go green or they don't. There's no judge model, no rubric, no partial credit for vibes. The widely-used Verified subset is 500 instances that contracted engineers hand-checked (with OpenAI) to confirm each problem is solvable and fairly graded. It is the most objective of the three precisely because the oracle is a test suite. It is also single-shot and offline: no conversation, no user, no live tools.

GAIA (Meta AI + Hugging Face, arXiv 2311.12983) asks whether the agent can chain heterogeneous tools — reasoning, web browsing, multimodality — across many steps to land on one unambiguous answer. Its 466 questions are sorted into three levels, from a couple of steps to long-horizon plans of dozens. The signature result is the gap: humans score about 92%; GPT-4 with plugins scored roughly 15% at release. That spread isn't measuring knowledge. It's measuring whether a model can execute a plan across tools without losing the thread.

τ-bench (Sierra, arXiv 2406.12045) asks the question the other two can't: can the agent follow a written policy across a multi-turn conversation while driving tools — and do it the same way twice? A simulated user talks to a customer-service agent in a retail or airline domain; the agent has API tools and a policy document; grading compares the final database state to an annotated goal, so saying the right thing isn't enough — the agent has to take the correct, policy-compliant actions.

The axis nobody else measures

Here is the load-bearing idea, and it's why τ-bench is the one that maps to production.

τ-bench reports pass^k: the probability a task succeeds across all k independent trials. Read that twice, because it's the inverse of the metric you're used to. pass@k rewards getting it right once in k tries; pass^k demands getting it right every time. As k climbs, pass^k falls — and the fall is steep. The paper's own tables show state-of-the-art function-calling agents dropping below 25% at pass^8 in retail, while their single-run scores looked respectable in the low-to-mid 60s. The airline domain, with its tangle of tier- and cabin-specific rules, scores lower still.

SWE-bench and GAIA tell you how capable the agent is. pass^k tells you how often it betrays you. In a workflow that needs it right every time, the second number is the only one that matters.

This is the trap in reading agent leaderboards. SWE-bench and GAIA headline single-run, pass@1-style accuracy. That number hides the production failure mode entirely. An agent that resolves 70% of issues on a given run sounds shippable — until you put it in a loop that needs it right on the first try, every customer, every time, and discover its real-world success rate is governed by its worst run, not its best. Capability is necessary; reliability is the wall. Only τ-bench makes you look at the wall.

What 2026 did to the numbers

The other reason not to worship a single leaderboard cell: SWE-bench Verified is saturating and contaminated. It's been public long enough to be thoroughly exposed in training data, and audits have flagged grading and test-quality problems in its hardest instances. That's the explicit motivation for a new wave — SWE-bench Pro (Scale AI) rebuilds the task on copyleft and private repos to resist contamination and stretches to long-horizon, multi-file changes; frontier models that clear the 70s–80s on Verified land around 59% there. τ-bench is iterating in the same direction toward reliability and harder coordination — τ²-bench adds a dual-control telecom domain where the user also holds tools.

So stop asking which benchmark is hardest. Ask which axis you're actually buying. Shipping a coding harness? SWE-bench's verifiable oracle is your signal — but remember the score belongs to the model, not the harness. Building a deep-research agent? GAIA's tool-chaining is the test. Putting an agent in front of customers under a policy? τ-bench, and read the pass^k column, not the pass^1 one. And whatever you build, when you wire up your own eval harness, copy τ-bench's instinct: measure the agent across many runs, because the production question was never "can it" — it was "can it again."