You add a test case for the agent. Green. You push a one-line change to a comment somewhere else in the repo and the same case comes back red. You re-run the job and it's green again. The muscle memory from fifteen years of unit testing takes over: the test is flaky, quarantine it, deal with it later.

That instinct is the actual bug. Not in the test — in the mental model you brought to it.

A deterministic function has a return value. You assert on it, and the assertion is a fact. A non-deterministic agent does not have a return value for a task; it has a distribution over behaviors — the same reason writing the eval before the prompt works and reading a single output does not — which tool it calls, in what order, whether it recovers from a bad result, what it finally says. On any given task the honest quantity is not "does it pass" but "how often does it pass." A single CI run draws one sample from that distribution and reads it as a verdict. When two samples disagree, nothing broke. You just measured a rate with a sample of size one and got surprised that it moved.

A flaky agent test is almost never a broken test. It's a correct measurement, taken wrong.

You can't cheaply make the randomness go away#

The tempting escape is to remove the non-determinism at the source — set temperature to 0 and pretend you're back in unit-test land. It doesn't work, and the reason is worth internalizing because it tells you why the statistical approach is mandatory rather than optional.

Temperature 0 only makes the token-sampling step greedy. It does nothing about the layer underneath. As Thinking Machines laid out in their analysis of why LLM inference isn't deterministic, the real culprit is batch invariance: the GPU reduction kernels behind normalization, matrix multiplication, and attention produce subtly different floating-point results depending on the batch they happen to be computed in. On a shared endpoint your request is batched with strangers' requests, and the batch composition changes run to run — so the same prompt at temperature 0 can still diverge. It is fixable: batch-invariant kernels get you 1,000 bit-identical runs, and the SGLang team shipped a deterministic mode built on the same idea. But it costs — reported overheads land in the 30–60% throughput range — and, decisively, it is not a knob the hosted APIs most agents call expose to you. You do not get to turn the randomness off. So you have to test through it.

Move the assertion from a value to an interval#

If the quantity is a rate, gate on the rate. Run each case you care about k times and assert on a confidence bound over the results, not on a single outcome. This sounds expensive and philosophical until you look at how fast the uncertainty actually collapses.

The recent work on randomness in agentic evals put numbers on it: the 95% confidence interval around a measured pass rate shrinks from 14.1% at a single run to 2.97% at three runs, and only reaches 0.56% out around 28 runs. Read that curve carefully, because it's the whole strategy. A single-run result carries a ±14% error bar — which is why a 2-to-3-point "improvement" from a prompt tweak is, in that same research, frequently indistinguishable from noise. Three runs erase most of that. Past five or ten, you're spending a lot of tokens to shave off tenths of a percent. The correct default is not "run once and quarantine the flakes" and not "run it fifty times to be safe" — it's k ≥ 3, and gate on the lower bound.

Gate on the metric that matches the stakes#

Once you're running k times you have to decide what "pass" means across those runs, and this is where most teams quietly pick the flattering answer. There are two metrics and they are not interchangeable:

A demo runs once and gets to be lucky. A production agent runs on every user and has to not be unlucky. That's pass^k, and it is brutally lower than the pass@k number on the same system. This is also why the hard part of evaluating a multi-agent system is compounding: each additional hop multiplies another per-step pass rate into the product. Gate reliability-critical flows on pass^k — or, more simply, on the worst run you observed — and treat the pretty pass@k as a ceiling, not a promise. The gap between the two is the exact size of the reliability problem you'd otherwise ship.

The token bill is the real objection, and sequential testing is the answer#

Running every case k times multiplies your eval cost by k, and for a large suite that's a real number on an invoice. The naive fix — cut k — throws away the certainty you just bought. The right fix is to make k adaptive: sample a case until its confidence bound clears the pass/fail threshold, then stop. A case that's clearly good or clearly broken resolves in two or three runs; only the genuinely borderline cases spend the full budget, which is exactly where you want to spend it — and it composes cleanly with generating the cases themselves by testing the agent against simulated users. This is old statistics — sequential testing — and agent-eval harnesses have started building on it. AgentAssay, a recent regression-testing framework aimed squarely at non-deterministic agent workflows, reports 78–100% cost reductions against fixed-sample testing while holding its statistical guarantees.

None of this makes your agent deterministic, and that's the point. The flake you were about to quarantine was never noise to be silenced. It was the first, low-resolution frame of the one number you most need before you ship: how often this thing actually works. Turn the assertion into a confidence bound and the flake stops being an embarrassment in your CI log and starts being the measurement it always was.