Here is a bug report that should not exist. A team bumps their agent from one model version to the next — same vendor, a strictly better model, nothing changes in the prompt — and a week later two things are true that nobody shipped: the monthly bill is up double digits, and the agent has started giving shorter, shallower answers to the exact prompts it used to nail. No diff to point at. The commit that "caused" it changed one string: the model name.
This is the migration nobody plans for, because it doesn't look like a change. It looks like standing still while the ground moves. And it moves for a specific, avoidable reason.
The same prompt is not the same program#
Your prompt is not a spec the model obeys — it's a program you compiled against a particular interpreter. Every "always do X," every few-shot example, every tool description was tuned, consciously or not, to the old model's habits: how verbose it ran, how deeply it reasoned by default, how it phrased a tool call, where it drew a refusal. Swap the interpreter and the same source behaves differently, because the defaults it assumed are gone.
The vendors say this out loud. OpenAI's own migration guidance is to treat a new model as a family to tune for, not a drop-in replacement — it even ships a Prompt Optimizer and tells you to start from a fresh baseline rather than port every rule from the old stack. Anthropic's migration guide opens the same way. When both frontier labs preface an upgrade with "re-tune, don't reuse," that's not boilerplate. It's the whole warning.
Three things move when you touch nothing#
Be concrete about what shifts under an unchanged prompt.
Reasoning-effort defaults. The knob that controls how hard the model thinks per call has a default, and the default changes between versions. GPT-5.5 moved its default effort to medium; transitioning from a version that defaulted high, the same prompt now reasons less. Anthropic's Sonnet 4.6 defaults to high effort where Sonnet 4.5 had no effort parameter at all. So depending on which way you're migrating, your agent silently gets shallower or slower — and you never typed the change.
Token counts. Tokenization is not stable across versions. Anthropic warns that the same text produces a higher token count on Opus 4.7 than on 4.6, and advises re-running count_tokens on a representative sample instead of applying a blanket multiplier. An identical prompt therefore costs more, adds latency, and sits closer to its context ceiling on the new model — the Sonnet 5 tokenizer tax is the same phenomenon with a price tag. Your budgets were computed against the old tokenizer; they're now wrong.
Format and refusal calibration. How reliably the model emits your JSON schema, how it names tools, which edge cases it refuses — all of it was tuned against the old model. None of it is guaranteed to survive the jump, and the failures are quiet: a schema that conforms 99% of the time instead of 100%, a tool the model now skips on the margin.
A model upgrade is the only change that ships itself. You approve the version string; the behavior change rides along for free, and you find out from your metrics.
Freeze the baseline before you switch — or you can't prove anything#
Here's the operational core, and it's a sequencing rule, not a tool: capture your eval baseline on the OLD model first. The instant you cut over, the old model's behavior on your workload is gone, and every "did this regress?" question becomes an argument about memory.
So do it in order. Freeze a golden set — a versioned collection of representative cases with the behavior you want, 30 to 50 to start, grown from every incident. Score it on the current model now, while you still can, and that becomes your baseline. Run the candidate model against the identical frozen set, and diff behavior — tool choice, format conformance, refusal edges, cost per case — not just whether the final answer looks right. This is the same discipline as shipping any agent change behind an eval gate and the same care migrating embedding models demands, applied to the generation model. Only after the gate passes do you canary: route 5–10% of live traffic to the new model, watch error rate, latency, and cost for a day or two, step to 50%, then cut over. The canary catches the distribution shift your golden set didn't predict; the offline gate is what decides whether you canary at all.
The clock is not optional#
You might be tempted to skip all this by simply not migrating. That option is closing. In February 2026, OpenAI forced a hard API cutover of GPT-4o, GPT-4.1, and o4-mini — no six-month legacy snapshot, calls returning deprecation errors — and it has scheduled GPT-5 variants for deprecation on July 23, 2026. Model availability is now a supply-chain dependency with an expiry date stamped on it. "Stay on the model that works" is a plan, but it's a plan with a deadline, and the deadline is set by someone else.
The migration you do on your schedule is a Tuesday. The one you do the week your model 404s is an incident. The difference between them is entirely whether you froze the baseline while you still had it.



