Pick almost any agent benchmark and you'll find the same hidden assumption: the task is the model's to lose. SWE-bench hands it a whole repository. terminal-bench hands it a shell. GAIA, OSWorld, WebArena — all of them put every lever inside the agent's reach and then ask whether it pulls them in the right order. That is a real and useful thing to measure. It is also not the job most deployed agents actually have. A support agent doesn't own the customer's router. A booking agent can't click the confirmation in the user's inbox. The single most important thing the τ-bench family measures is the one almost everything else assumes away: what happens when the agent does not control the whole world.

τ-bench: keep the user in the loop, grade the end state#

τ-bench — the name is Tool-Agent-User, from Yao, Shinn, Razavi and Narasimhan at Sierra in 2024 — sets up two domains, retail and airline, that look like customer service because they are. Each domain ships three things: a database of mock customers, orders and reservations; a set of API tools the agent calls to read and write that database; and a written policy document of rules the agent must obey ("you may only refund within the window," "you must verify identity before changing a booking"). Facing the agent is a user — itself an LLM — that holds a goal and dribbles it out the way real customers do, one grudging detail at a time.

The grading is the sharp part. τ-bench does not score whether the conversation sounded helpful. It runs the dialogue to its end and compares the final database state against an annotated goal state. A warm, fluent agent that cheerfully issues a refund it wasn't allowed to issue fails, flatly. This is what makes the benchmark honest: it collapses "was the agent nice" and "did the agent do the correct thing under the rules" into the second question only.

The 2024 numbers landed hard. State-of-the-art function-calling agents solved well under half of the tasks in the harder airline domain on a single attempt — and τ-bench introduced pass^k precisely because single attempts flatter agents. pass^k asks whether all k independent runs of the same task succeed; on retail, the leading agent's pass^8 fell under roughly a quarter. The agent wasn't getting dumber across those eight runs. It simply couldn't reproduce its own correct trajectory, and an end-state grader is the only thing that makes that visible.

τ-bench's quiet thesis: the hard part of a customer agent isn't knowing the answer, it's reliably doing the allowed thing while a human feeds you the problem sideways.

τ²-bench: now give the user hands#

The 2025 follow-up, τ²-bench (Barres and colleagues, again Sierra), changes exactly one structural thing — and it's the right one. In the original, only the agent could act on the environment; the user could only talk. τ²-bench makes it dual control: the simulated user gets their own tools and can change shared state too. The paper models this honestly as a Dec-POMDP — a decentralized, partially observed process where two actors share one dynamic world and neither sees all of it. Its flagship domain is telecom troubleshooting, where the customer has to actually do things to their own device: read a code off the router, toggle a setting, reboot, report back. The agent can no longer reach in and fix the problem. It has to talk a human into fixing it, in the right order, without watching their hands.

To keep the new tasks trustworthy, τ²-bench generates them compositionally — atomic scenarios with explicit initialization, solution, and assertion functions, combined and then kept only when a deterministic check confirms they're solvable. That's how you scale task count without quietly smuggling in unverifiable cases.

And then the result that should reframe how you read every agent leaderboard. Sierra reports a frontier model — GPT-4.1 — falling from roughly 74% to roughly 34% pass@1 as the same competence is forced through the dual-control setting, with finer ablations attributing an 18–25% chunk of the drop specifically to the user-must-act requirement. (Treat the headline gap and the isolated ablation as two different cuts of the same effect; live leaderboard numbers for newer models run higher and shouldn't be pinned to the paper.) The model didn't lose 40 points of intelligence between those two columns. What it lost was coordination — the ability to make a second, semi-autonomous actor it can only nudge with language perform the steps it would have trivially performed itself.

Why this is the benchmark to watch#

The pattern across the two papers is one idea sharpening. τ-bench took the actions away from being a solo exercise and made them happen in front of a user who controls the information. τ²-bench took the next step and gave that user control of some of the actions. Each move strips away a little of the omnipotence that other benchmarks grant by default, and each move costs frontier agents real points — not because the tasks got more clever, but because the agent's competence now has to survive a trip through someone else.

That trip is the actual job — and it's why, among the benchmarks that claim to predict production, the tau-style ones earn the most weight for anything user-facing. In production your agent is almost never holding every lever: the user has to click the magic link, read the code off the screen, confirm the charge, run the command you suggested. An agent that aces a solo benchmark and then can't walk a frustrated human through three steps is not a hypothetical — it's the most common way agents fail once they meet real users. If your agent talks to people and changes real state, this is the shape of eval to copy: end-state-graded so politeness can't paper over wrong actions, policy-bound so rule-following is scored, user-in-the-loop (ideally with simulated users) so coordination is tested, and reported as pass^k so you see reliability rather than a lucky single run. Pair it with trajectory-level grading when how the agent got there matters, and run it online where the users are real.

The uncomfortable summary the τ-bench line keeps proving: capability is what a model can do when the whole task is in its hands, and that's what benchmarks are built to flatter. Coordination is what it can do when half the task is in someone else's — and that's what your customers actually buy.