Somewhere in your continuous-integration pipeline, a number went down this week. The regression test that grades your agent's answers — a strong model reading each output against a rubric and returning a score — came back lower than last week. The obvious reading is that something you shipped made the agent worse. Before you start bisecting commits, sit with a more uncomfortable possibility: the thing that changed might not be your agent. It might be the judge.

LLM-as-a-judge has become the default way teams score agents at scale, for a good reason — human grading doesn't fit inside a CI run, and a capable model applying a rubric is fast, cheap, and consistent enough to gate a release. The move that makes it work is to treat the judge's grade as ground truth. That move is also the bug. Ground truth is supposed to be fixed. This ground truth is a model behind an API, and it moves.

The undeclared floating dependency#

Every dependency your build relies on is pinned somewhere — a lockfile, a container digest, a version constraint. The judge model is the one dependency most eval suites never write down. You call gpt-whatever or claude-whatever, the provider ships a new snapshot or adjusts its tuning, and the grades on identical content shift. Nobody edited your rubric. Nobody touched your prompt. The measuring instrument recalibrated itself and didn't tell you.

A drop in an LLM-judge score is ambiguous by construction: it could mean your system regressed, or it could mean the judge did. The same alarm fires for both, and the alarm can't tell you which.

This is worse than ordinary flakiness, because it's directional and it's slow. A judge that drifts a little stricter over a quarter will quietly make every system look like it's decaying, and you'll go looking for a regression that was never in your code.

"Just pin the version" is half an answer#

The first instinct is correct and insufficient: pin the judge. Freeze the exact model snapshot, freeze the scoring prompt, freeze the rubric, and version the three together so a score is reproducible. Do this. It converts silent drift into no drift within a window — which is a real gain, because the score deltas you see are now attributable to your system.

The catch is the expiry date. Every major lab now retires old model versions on a schedule; a pinned judge is a countdown. When the snapshot you pinned is deprecated, you're forced onto a new one, and the new judge does not grade like the old judge. Pinning doesn't remove the migration — it collects all the drift you avoided and hands it to you at once, on the provider's timetable, as a hard re-baselining event. Better to face it deliberately than to absorb it silently, but "pin it" is a way to schedule the problem, not solve it. And it does nothing for the question you actually have on the day the number moves: which thing moved?

Make it an attribution problem#

The more honest framing comes from a 2026 paper with the exactly-right title, "Who Drifted: the System or the Judge?". Its point is that stability is the wrong goal; attribution is the goal. You will never stop the judge from changing. What you can do is always know when it did.

The mechanism is almost boring, which is why it works. Keep a fixed set of examples with trusted human labels — an anchor set. On a steady interleave, alongside your live evals, have the current judge re-score that anchor set. Then run an anytime-valid test on the gap between the judge's grades and the human labels — the "anytime-valid" part matters, because you're peeking at this continuously and a naïve significance test would cry wolf. A guard-window rule turns the result into a verdict: none (nothing moved), system (your agent moved), or judge (the instrument moved). When your product score drops but the anchor set still matches its human labels, the regression is real. When the anchor set itself slips against the humans, the judge drifted, and you stop bisecting commits that were never the problem.

The cost is the honest part: you have to build and maintain a human-labeled anchor set. That's the tax on trusting an automated judge, and it's the tax most pipelines skip. A survey of the field makes the gap explicit — the literature obsesses over what the judge says about the system, and almost never subjects the judge itself to scrutiny for accuracy, stability, and bias. Tooling is catching up: platforms like Braintrust and Langfuse now version rubrics and prompts and trace every judge call, which is the infrastructure this requires.

The one idea worth keeping: the judge is an instrument, and instruments need a calibration standard. A scale you never check against a known weight isn't measuring — it's guessing with a decimal point. A judge you haven't re-audited against humans since the day you deployed it is doing the same thing. The number it prints looks exactly as authoritative either way. That's the danger.