Two agents post the same score on your eval. One you'd put in front of a customer; the other you wouldn't trust to file an expense report unsupervised. The number didn't lie — you just measured the wrong thing. The single most useful habit in agent evaluation is knowing which of two questions your metric is answering: can this agent ever solve the task, or will it solve the task every time. Those are different questions, they have different metrics, and the distance between them is exactly where production failures hide.

The two metrics, and why they point opposite directions#

pass@k comes from code generation. In the 2021 Codex paper, Chen et al. defined it as the probability that at least one of k sampled solutions passes the tests, with an unbiased estimator over a larger sample so the number isn't noisy. The defining property: pass@k increases with k. Draw more samples, and your odds of getting one right can only go up, climbing toward 1. It measures a ceiling — what the model is capable of when something downstream gets to pick the winner. That's the right metric for best-of-n with a verifier: generate ten patches, run the unit tests, keep the one that's green.

pass^k flips the quantifier. Introduced for agents in τ-bench (Yao et al., 2024), it's the probability that all k trials of a task succeed. The defining property is the mirror image: pass^k decreases with k. The more times you ask, the more chances to slip, so the number falls toward 0. It measures a floor — consistency, the thing you actually depend on when no human is standing by to retry.

pass@k measures whether an agent can. pass^k measures whether it always. A leaderboard usually reports the first; production runs on the second.

τ-bench's own numbers make the gap concrete. State-of-the-art function-calling agents solved under half of its retail tasks on a single attempt, and the pass^8 — succeed on the same task eight times running — fell under roughly a quarter. The agent wasn't getting dumber. It just couldn't reproduce its own correct trajectory, and pass^k is the only metric that makes that visible. If you'd reported pass@8 instead, the same agent would have looked like it was improving.

Why "demos great, ships broken" is a math fact#

The reason this gap is structural, not a quirk of one benchmark, is that agent tasks are chains. A real task is a sequence of steps, and many of them have to all go right. If each step succeeds independently with probability p, the whole n-step task succeeds about p^n of the time. That decays fast: 99% per step is 90% over ten steps and 37% over a hundred; 95% per step is ~60% over ten and under 8% over fifty. High single-step accuracy and low end-to-end reliability are not in tension — one implies the other once the chain gets long. This is the quantitative spine of why agents fail in production.

The clean p^n model is the intuition; the evidence is better than the model. Toby Ord's 2025 analysis fit METR's long-task data with a strikingly simple curve — a constant chance of failing during each minute a human would spend on the task — which makes success decay exponentially with task length, giving every model a characteristic half-life. For Claude 3.7 Sonnet that half-life is about 59 minutes: a task a human would take an hour on lands near 50% success, a two-hour task near 25%, a four-hour task near 6%. Same model, same capability, three wildly different reliabilities — the only variable is how long the chain is. METR frames the same thing from the other side: a model's 80%-reliability time horizon is several times shorter than its 50% horizon, which is why the headline benchmark number flatters work you'd never actually trust it to do unattended.

One honest caveat keeps this from being a parlor trick. Real step failures aren't independent — an early mistake corrupts the context and makes later ones more likely (worse than the product rule), while a model that notices and recovers does better than it. METR cites that self-correction as one driver of rising horizons over time. So treat p^n as the floor-of-intuition and Ord's half-life as the empirically fitted reality. The direction is the same either way: reliability falls with length, and a single pass@1 on a short task is structurally blind to it.

What to actually do#

None of this means pass@k is wrong. When you genuinely have a verifier and can deploy best-of-n — code with unit tests, retrieval with a re-ranker, a tool call you can validate before committing — pass@k is the correct metric, because you really do only need one good draw. The error is using it as a stand-in for unattended reliability, where you get one draw and have to live with it.

So measure both, and stop shipping a single number:

The uncomfortable summary: capability is what a model can do on a good day, and it's what benchmarks are built to flatter. Reliability is what it does on every day, including the long ones, and it's the only thing your users experience. pass@k and pass^k are the two halves of that sentence. Ship the wrong one and the agent will look ready right up until the moment it isn't.