You did the responsible thing. You wrote evals for your agent, wired them into GitHub Actions, and gated the merge on a green run. For a few days it felt like maturity. Then a pull request that touched a comment — not a prompt, not a tool, a comment — came back red. You re-ran it. Green. You added a retry. Congratulations: you have built a flaky test, and unlike a flaky unit test, no amount of retrying will fix it, because the flakiness isn't a race condition. It's the model.
The mistake is upstream of the YAML. You imported a contract from software testing — assert that this input produces that output, fail the build if it doesn't — into a system that doesn't honor contracts. A unit test works because a deterministic function has one right answer to assert against. An eval scores a stochastic system, so its pass rate is a sample from a distribution, not a fixed value. Gate a merge on a single sample and you're gating on noise.
An eval is a measurement, not an assertion#
The tempting defense is "I set temperature to 0, so it's deterministic." It isn't. As Thinking Machines laid out in detail, the dominant source of nondeterminism in production inference isn't sampling — it's that the GPU kernels aren't batch-invariant. The same prompt takes a slightly different floating-point path to the logits depending on what else is in the batch with it, and under greedy decoding that can flip a token, which cascades. Your eval ran on a busy endpoint at 2pm and a quiet one at 2am and got two different numbers, and neither run was wrong. They were two draws.
Once you accept the output is a draw, "did it pass?" is the wrong question. The right one is "did the score move more than the noise?" — and that's statistics, not a boolean.
This is the whole reframe, and Anthropic wrote the playbook for it in Adding Error Bars to Evals: treat an evaluation as an experiment, report a standard error, and test the difference between two runs rather than eyeballing two pass rates. A CI gate that ignores the error bar will reject good PRs on a downward wobble and wave through real regressions hidden inside the noise floor. The better harnesses already concede this in their config — promptfoo's GitHub Action exposes repeat (run each case N times) and repeat-min-pass (require K of N) precisely because, in its own words, LLM eval outputs are non-deterministic and random grader variance has to be tolerated, not retried away.
Tier the suite so the cheap checks gate every PR#
The fix isn't to abandon CI — it's to stop running one undifferentiated suite. Hamel Husain's widely-cited evals framework sorts checks into three levels, and the load-bearing insight is that cost dictates cadence. Level 1 is assertions — the deterministic, LLM-free checks: does the output parse as JSON, match the schema, contain the required citation, satisfy the tool-call contract, exact-match the golden answer? These cost nothing and run in milliseconds, so they gate every commit. This is exactly the split promptfoo draws between deterministic assertions (contains, is-json, equals) and model-graded ones (llm-rubric), and the tier DeepEval exposes as pytest assertions you can actually fail a build on.
Level 2 — the LLM-as-judge sweep — is where teams blow their budget by running it per-PR. A model-graded check makes a second inference call to grade the first, so the judge suite roughly doubles the API spend and the wall-clock per case. Multiply by hundreds of cases across a few models and it's real money and real CI minutes on every push. So you move it off the critical path: run the judge sweep nightly, or only when a PR is labeled for merge, and run it on the Batch API for the roughly 50% async discount. The PR waits on Level 1; Level 2 reports overnight.
Gate on the delta, not the pass rate#
When the judge sweep does run, don't assert an absolute threshold — compare to a pinned baseline. Score the PR's branch against your golden set, score main against the same set, and fail only if the number dropped by more than a tolerance you chose with the error bar in mind. That converts "is 0.84 good?" (unanswerable) into "is 0.84 worse than main's 0.87 by more than noise?" (answerable). Braintrust's eval-action ships this as a GitHub Action that posts a per-case improved/regressed diff right on the pull request — the same baseline-comparison move eval platforms are converging on. This is also the cleanest division of labor with online evals: the baseline-delta gate is your pre-merge tripwire; production scoring catches the failures you never thought to put in the golden set.
Your eval set is code, and it rots#
The last trap is the quietest. A green gate is only as honest as the dataset behind it, and datasets decay. They leak into training data — the next model has effectively seen your test. They drift from what production actually sends. And they overfit, because every time something broke you added the case that caught it, until the set measures yesterday's bugs and nothing else. Worse, the judge prompt is itself an unreviewed program deciding which of your releases ship. So version the golden set like source, review the judge prompt in PRs, and re-baseline on purpose, with a note in the commit. A CI that's green against a contaminated set is worse than no CI: it tells you you're safe while measuring nothing.
None of this is the deterministic CI you know, and pretending otherwise is what produces the flaky gate. Continuous integration for agents is continuous measurement — a control chart with a baseline, not a tripwire that asserts a boolean. You don't ship when the test passes. You ship when the number holds.



