A model you depend on is being deprecated. OpenAI's deprecations page gives GA snapshots six months and preview models as little as two weeks; Anthropic commits to sixty days before retiring a public model. The clock is the most common reason anyone rolls out a new LLM at all — not ambition, just the floor falling out from under the old one. So you wire the new model behind the same progressive-delivery rig you'd use for any service change: canary a few percent, watch the dashboard, promote if it's green. A week later it's all promoted, the dashboard never blinked, and your support queue is full of users saying the assistant got dumber.
The rig did exactly what it was built to do. That's the problem.
The failure mode has no error code#
Progressive delivery inherited one load-bearing assumption from the web services it was designed for: a bad release announces itself. It throws a 500, it spikes p99, it pegs a CPU, it crashes a pod. Your canary controller — Argo Rollouts, Spinnaker, Flagger — is a machine for noticing those announcements. It runs an analysis template that queries Prometheus for the error rate every thirty seconds and aborts the rollout if the number crosses a line.
An LLM regression makes no announcement. The worse model returns HTTP 200, inside its latency budget, with grammatical, confident prose. It just happens to hallucinate a field, drop a tool call, or answer in the wrong register. There is no exception to catch, no status code to count, no latency cliff to alarm on.
Your canary is watching the door for an intruder who comes in through a 200.
This is not a tuning problem you fix with a tighter threshold. The signals the controller knows how to read are structurally silent on the failure you care about, so it will promote a model that got measurably worse and report success while it does it. Google's SRE Workbook says the quiet part directly: "the representativeness of a canary is tightly connected to the metrics chosen for evaluation." Choose error rate as your metric and you've built a canary that's representative of nothing about model quality.
You have to manufacture the missing signal#
The fix isn't a fancier rollout tool. It's admitting that for a stochastic system you have to generate the quality signal and feed it into the same promote/rollback decision. That signal is an online evaluation: an LLM-as-judge or a guardrail/heuristic metric, run on a sample of the canary's live traffic. LangSmith, Langfuse, Arize Phoenix, and Braintrust all ship the same shape: scorers that run asynchronously against the production stream and write a number back next to the trace. Sample 1–10% — a judge call roughly doubles your spend per scored request, and you don't need every request to see a trend — and make that score, not the 5xx rate, the thing your rollback watches.
This is also where the canary connects to the gate you (should) already have before the merge. The offline eval suite that runs in CI proves the candidate is plausibly good on a frozen golden set; the online eval proves it's actually good on this week's real traffic. They're the same instrument pointed at two different populations, and you want both, because the golden set is always a little stale and production never stops drifting.
Shadow and canary answer different questions#
The instinct is to order the rollout stages by risk — shadow is "safe," canary is "riskier," full rollout is "riskiest" — and ramp up your nerve as you go. That framing hides the actual reason you run all of them. The rungs differ by what signal each one can physically produce.
A shadow deployment mirrors real production inputs to the candidate and throws its answers away — no user ever sees them. That's worth a lot: you get the candidate's behavior on the messy, real distribution of inputs, at exactly zero user risk, and you can diff it against the incumbent offline. But it can never give you a user-outcome signal, because outcomes require a user, and there isn't one. Shadow answers "does it work on real inputs?" and is constitutionally incapable of answering "does it work for real users?"
That second question is the entire reason the canary exists. The canary is the first rung where a real person acts on the candidate's output, so it's the first rung that yields an outcome — a thumbs-down, an abandoned session, a completed task, a refund. You pay for that signal by exposing a bounded slice of users to a model you're not yet sure of. So the sequence isn't timid-to-brave; it's output signal, then outcome signal — two different instruments, run in the order that spends user risk only once the cheaper instrument has already cleared the candidate. Blue-green, by contrast, gives you no pre-cutover signal at all; its single virtue is an instant router flip back to the old model when the canary you ran first told you to abort.
Two agent-specific traps#
First: bucket deterministically. When you graduate to an A/B test, assign each user to the old or new model by hashing a stable id — user or session — the way GrowthBook and Statsig do, so the same person lands in the same bucket on every request. Random per-request assignment is fine for a stateless web page and poison for an agent: it flips the model mid-conversation, which changes the tool-calling style and the memory the agent carries forward, and your "experiment" is now measuring incoherence, not the model.
Second: promote on a delta, not a number. The online score is a sample from a distribution — the same nondeterminism that makes evals noisy makes this score noisy — so "the canary scored 0.84" tells you nothing on its own. Compare it to the incumbent on the same live population and promote only if it didn't drop by more than a tolerance you set in advance, with enough traffic per arm that the difference clears the noise. This is a significance test, and the mature canary tools already treat it as one: Kayenta, the analysis engine behind Spinnaker and Argo Rollouts, decides promote-vs-abort by running a Mann-Whitney U test on the baseline-versus-canary distributions, not by thresholding a single reading.
Rolling out a new LLM looks like a deploy and is actually an experiment. The deploy tooling is mature and you should absolutely use it — managed by a flag layer like LaunchDarkly's AI Configs so you can flip models without shipping code. But the part that keeps you safe isn't the traffic-shifting machinery. It's the quality signal you bolt onto it, sampled from live traffic, compared to a baseline, deciding the one thing your error-rate dashboard will never tell you: not did it crash, but did it get worse.



