Here is a failure mode you only see once. You tweak a system prompt to fix one annoying behavior, the change looks obviously correct, you ship it, and three days later support is on fire because the agent quietly stopped calling a tool it used to call on 8% of conversations. The diff was four words. Nothing in your dashboards moved fast enough to stop you.

The reflex is to reach for the playbook you already trust: feature-flag it, A/B test it, watch the metric. That playbook is built for changes whose effect is a single observable event — a click, a conversion, a latency number. It quietly fails for agents, and it's worth being precise about why, because the fix follows directly from the reason.

The unit of variance is a trajectory#

When you A/B test a button, each user produces one clean datapoint and the metric is dense. When you change an agent, each session produces a trajectory — a branching sequence of model calls, tool invocations, recovered errors, and a final outcome — and the outcome you actually care about (ticket resolved, task completed) is sparse and noisy. A change that breaks 5% of trajectories barely dents an aggregate success rate, and reaching statistical significance on that dent can take weeks of traffic. By then the broken behavior has been live the whole time.

So the first rule is counterintuitive: the gate cannot be online. Online metrics are a backstop you keep watching, not the thing that decides whether a change ships. The decision has to be made before traffic ever sees the candidate, which means it has to be made offline, against examples you control.

An outcome metric is a smoke detector, not a seatbelt. It tells you the building is already burning. The gate has to stop you before you light the match.

Shadow mode for an agent is not shadow mode for a model#

"Run it in shadow" is the standard answer, and for a classifier it's exactly right: send live traffic to both the old and new model, log both predictions, compare, never act on the shadow. You can do that because a prediction is inert.

An agent's output is not inert. It sends the email, files the refund, writes the row. You cannot dual-run a candidate against live traffic when "running" means taking the action — you'd take every action twice. So shadow mode for agents has to mean something different: replay recorded production traces against the candidate prompt or model, and score how the candidate's decisions diverge from what actually happened. Same inputs, same tool results played back from the log, no real side effects — a trace replay, not a live shadow. This is the piece teams import most carelessly from MLOps, and it's the piece that's genuinely different here.

Score the path, not just the answer#

Replay gives you a candidate trajectory for each recorded case. The temptation is to grade only the final answer, because final answers are easy to compare. Resist it. An agent reaches a right-looking answer through a wrong path all the time — it picks the wrong tool and gets lucky, makes three redundant calls a cheaper path would avoid, or recovers from an error in a way that won't recover next time. Trajectory evaluation — grading tool choice and the decision sequence, not just the output — surfaces a meaningfully larger share of regressions than output-only scoring does. Output-only grading is comforting precisely because it hides the failures that bite later.

In practice this is a CI gate, and it looks ordinary once built:

Only after that gate does the canary earn its place: route a small slice of live traffic to the new version, watch the online metrics and cost, and promote if nothing degrades. The canary is the backstop that catches the distribution shift your golden set didn't anticipate — real, valuable, and the last line, not the first.

None of this is exotic tooling. Langfuse, LangSmith, Braintrust and friends all run replay-and-score in CI today. The thing that's hard to import is the mental model: an agent change is a behavioral change, the gate for it is offline, the replay is a trace replay because the agent acts on the world, and the score is on the path. Get those four right and the four-word prompt edit stops being a thing you find out about from your support queue.