Here is the number that should reframe how you think about agents. On τ-bench, the agent benchmark built by Sierra to mimic real customer-service work, GPT-4o solves a retail task about 61% of the time. Run the same task eight times, though, and the chance it succeeds on all eight drops below 25%.
Those are not two ways of describing the same model. They are two different questions, and the AI industry has spent three years answering the easy one.
The demo metric and the deployment metric
The easy question is pass@1, or its forgiving cousin pass@k: did the agent complete the task at least once? That is the number in the launch post and the leaderboard. It rewards a model that gets lucky on a good run — exactly the run you screenshot for the demo.
Production asks the opposite question. A customer who books the wrong flight, or a pipeline that corrupts a record, does not care that the agent can succeed; they care whether it succeeds this time, and the time after that. Sierra named the metric for this — pass^k, the probability that all k independent attempts succeed — precisely because average-case scores hide the worst case that ships to users.
A 61% model that is only 25% consistent isn't a 61% model. It's a coin you have to flip eight times and pray.
The gap between pass@1 and pass^k is unreliability, quantified. And it is why a model that looks production-ready in a notebook falls apart the moment it runs unattended a thousand times a day.
Why the math is worse than you've heard
The standard talking point is the compounding one: if each step is 95% reliable, twenty steps in a row succeed only 0.95²⁰ ≈ 36% of the time. It is a good intuition pump, and it is too generous.
That formula assumes errors are independent — that step 12 failing tells you nothing about step 13. Real agents violate this constantly. A value hallucinated in step 1 gets written into the running context and then cited as fact by every step after it; Drew Breunig calls this "context poisoning," and it is only one of four ways long contexts rot (alongside distraction, confusion, and clash). Once a wrong fact is in the agent's working memory, the per-step failure rate stops being constant and starts climbing. The clean 0.95ᴺ curve is the optimistic bound. The real one bends down faster.
A 2025 paper put a name on the shape: agent success has something like a half-life, decaying roughly exponentially with task length under a constant per-step failure hazard. METR measured the same wall from the other side — the task length today's agents can complete doubles about every seven months, but the horizon at 80% reliability is far shorter than at 50%. Demand more dependability and the set of things an agent can actually finish shrinks fast. The benchmarks echo it: on WebArena's 812 real web tasks, the best GPT-4 agent finished 14.4% to a human's 78.2%.
The argument the labs are having
If compounding is the disease, what's the cure? The two best-funded agent labs disagree in public, which is the most useful thing about the debate.
Anthropic published a multi-agent research system — an Opus lead delegating to Sonnet subagents — that beat single-agent Opus by 90.2% on their internal eval, while burning roughly 15× the tokens. Cognition, the team behind Devin, published a post titled, flatly, "Don't Build Multi-Agents," arguing that parallel subagents make conflicting decisions that compound into incoherent output.
They are both right, and the reconciliation is a single question: is the work parallelizable and read-only? Fanning out to gather information — search, read, summarize — parallelizes cleanly, because subagents don't have to agree on a shared mutable state. Writing code, or executing a transaction, does not; there the subagents' implicit decisions collide, and Cognition's "share the full trace, stay single-threaded" wins. Ask "are my steps tightly coupled and side-effecting?" before you reach for an org chart of agents.
Borrow the fix from distributed systems
The reliability wins that actually ship are not new ML tricks. They are the same primitives that have kept flaky distributed systems running for a decade, applied to a flaky model:
- Checkpoint after every step. Durable-execution engines like Temporal persist state after each LLM call and tool result, so a failure resumes from the last good point instead of restarting — and you retry the failed step, with backoff, not the whole run.
- Make actions idempotent. A retry must not double-charge a card or double-send an email. Idempotency keys turn "ran twice" into a no-op.
- Keep the tool set small. Anthropic names bloated tool sets a top failure mode: "if a human engineer can't definitively say which tool should be used, an AI agent can't be expected to do better."
- Verify before you act, and gate the irreversible. A cheap validation step catches a poisoned value before it propagates; a human approval catches the action you can't take back.
None of this makes the model reliable. That is the point. You don't fix a flaky component by staring at it harder — you assume it fails and build a system that survives the failure. The agents that work in production are not the ones running the smartest model. They are the ones that treat the smartest model like a network call that might time out.



