The appeal of using one language model to grade another is that it dissolves the most expensive part of evaluation. No annotators, no rubric meetings, no week of turnaround — you write a prompt, pass it two answers, and read back a winner. It feels like measurement. It is closer to asking a confident stranger who skimmed the question.
The problem is not that LLM judges are unreliable in some vague way. It is that their unreliability is systematic and now well measured, which means it has direction, magnitude, and — this is the part most teams miss — a specific address. Three of these biases are large enough to flip a leaderboard, and they do not live in the same place. Treating them as one problem is why most "we use an LLM judge" pipelines quietly lie to the people reading their dashboards.
Position bias lives in the prompt#
Show a judge two answers and it favors a slot, not just a response. In the MT-Bench work that established the LLM-as-a-judge baseline, Zheng and colleagues found GPT-4 returned the same verdict only about 65% of the time when they swapped which answer came first. A third of its judgments were partly an artifact of serialization order.
The reason this bias is the lucky one is that it lives entirely in the prompt. The judge never sees "answer A" and "answer B" as abstract objects; it sees a token sequence in which one of them is physically first. So the fix is mechanical: run the comparison in both orders and count a win only when the same answer wins both times; otherwise call it a tie. That is the swap-and-average protocol from the same paper, and it removes position bias completely. The cost is exactly one extra inference call — you pay 2x to make the number honest. For an eval you will rerun a thousand times, that is the cheapest correctness you will ever buy.
Verbosity bias lives in the model's preferences#
Now make the answers identical in quality but one of them longer. The judge tends to prefer the longer one. This is not a prompt-ordering quirk you can swap away, because the preference is in the model, not the layout.
The cleanest evidence is what it took to correct it. AlpacaEval's length-controlled win rate does not ask the judge to be fair about length — it can't be made to. Instead it statistically regresses length out of the score after the fact. The payoff is the tell: doing so raised AlpacaEval's correlation with human Chatbot Arena rankings from 0.94 to 0.98. The length signal was real, it was corrupting the metric, and the only thing that removed it was changing the measurement, not the prompt.
A prompt-level bias gets a prompt-level fix. A preference baked into the weights does not — you change what you measure, or you change who measures it.
Formatting bias is the same shape. The "From Lists to Emojis" study showed preference models, GPT-4 included, can be pushed around by bullet points, bold, and emojis independent of content — exploitable enough to inflate alignment-benchmark rankings on style alone. You normalize formatting before judging; you do not instruct it away.
Self-preference lives in the weights#
The deepest one: a judge scores its own outputs higher. Zheng's paper put rough numbers on it — GPT-4 favored itself by about 10% in win rate, Claude-v1 by about 25%. The interesting question is why, and two papers from 2024 converge on an uncomfortable answer.
Wataoka and colleagues found judges assign higher scores to lower-perplexity text — text that is more predictable under their own distribution, regardless of who actually wrote it. A model's own generations are, by construction, low-perplexity to that model. And Panickssery and colleagues showed the strength of a model's self-preference is linearly correlated with how well it can recognize its own outputs — and that altering self-recognition via fine-tuning shifts the self-preference with it. That is a causal-looking link, and it explains why no instruction fixes this: the judge is not being vain, it is rewarding familiarity, and its own writing is the most familiar text in the world to it.
The asymmetry is the whole point#
Lay the three side by side and a single rule falls out — the rule that should govern how you treat any judge bias you read about next. Ask where it lives.
- In the prompt? (Position.) A prompt-level intervention removes it cleanly. Swap and average.
- In the model's preferences or weights? (Verbosity, formatting, self-preference.) No prompt removes it. You either change the measurement (regress length out, normalize formatting) or change the judge (a different model family that cannot recognize the candidate, and that you never let grade its own kind).
This is also why the catalogs keep growing — "Justice or Prejudice?" now enumerates twelve distinct judge biases and scores models on a robustness rate against them. The list is useful, but the taxonomy that matters for a working pipeline has only two rows: biases you can fix with a better prompt, and biases you can only fix with a better experiment. The first row is short. Budget accordingly: swap your orders, control your lengths, and never, ever let a model be the judge of itself. And once you have corrected the score, ask the harder question the number still hides — whether you should be grading the final answer or the whole trajectory that produced it.



