LLM-as-a-Judge: How to Build an Eval That Doesn't Quietly Lie to You

Somewhere in your CI pipeline there is a number. It is between 0 and 1, it went up after last week's prompt change, and a teammate put it on a slide. The number came from a model grading your model. It feels like measurement. It is closer to asking a colleague who never reads the rubric to rate the work of a colleague who never reads the rubric — quickly, at scale, with total confidence.

LLM-as-a-judge is the most useful evaluation technique the field has, and the most quietly misused. The usefulness is real: human grading does not scale, and exact-match metrics like BLEU or ROUGE measure surface overlap, not whether the answer is good. A capable model reading an answer against a rubric correlates with human judgment far better than any n-gram score. The canonical study here, MT-Bench, found that a strong judge agrees with human raters more than 80% of the time — which is roughly how often two humans agree with each other. At that point the judge isn't a weak proxy. It's a peer.

The misuse is in what people skip on the way to the number.

You are not measuring your system. You are measuring an agreement.

Here is the reframing that changes how you build evals: an LLM judge does not score your model. It produces the agreement between your model and a second model's opinion of your model. Everything that's wrong with the second model is now baked into your metric, silently, in a direction you can predict.

The same MT-Bench paper that reported the cheerful 80% also catalogued the ways judges fail, and the failures are not random noise. They are biases — systematic, signed errors:

Position bias. Show a judge two answers and it tends to prefer the one it saw first. In the study, even the best judge stayed consistent under a simple A/B order swap only about 60% of the time. The rest of the time, the position decided the winner, not the content.
Verbosity bias. Longer answers score higher, independent of quality. A judge mistakes thoroughness-shaped text for thoroughness.
Self-enhancement bias. A judge favors text that looks like its own. GPT-4 preferred its own answers at a roughly 10% higher win rate; Claude-v1 favored itself by about 25%. If you grade a model's output with the same model family, you have built a flattering mirror and called it a ruler.

An unvalidated judge gives you a number with a confidence interval nobody computed. It is not measurement. It is a vibe with decimal places.

The one step almost everyone skips: validate the validator

If you take one thing from this: calibrate the judge against humans before you trust it. The technique is embarrassingly cheap. Hand-grade 30 to 50 examples yourself. Run the judge on the same set. Measure how often they agree. That agreement rate is the trust you're allowed to place in the automated score — no more.

This isn't a hand-wave; it's the core of the research. G-Eval earned its keep precisely by reporting its correlation with human labels (Spearman 0.514 on summarization, well above prior metrics) rather than asserting it. The UIST paper Who Validates the Validators? builds a whole workflow, EvalGen, around the uncomfortable finding that you can't even write good criteria until you've looked at outputs — judge alignment is iterative, not a prompt you get right once. And the survey Justice or Prejudice? enumerates twelve distinct judge biases, which is twelve more than most teams check for.

A judge you have not validated is not a measurement instrument. It's a model you are anthropomorphizing into a measurement instrument because the output has a number in it.

Prefer pairwise, but know what it costs

Two ways to ask the judge. Pointwise: "rate this answer 1–10." Pairwise: "here are two answers, which is better?" Reach for pairwise. Absolute scores drift across runs and cluster in the 7–9 band, so genuine differences disappear into rounding. Pairwise preferences track humans more closely and give you a stable ranking.

But pairwise has its own knife edge. A 2025 protocol study found pairwise preferences flipped in ~35% of cases when an irrelevant distractor feature was introduced, versus ~9% for pointwise. Pairwise is more accurate and more manipulable. So you pay the position-bias tax deliberately: run every comparison twice with the answers swapped, and only count a win that holds in both orders. Anything that flips is a tie. It doubles your judge calls and it is not optional.

What this means for your pipeline

The tools are mature — DeepEval ships a G-Eval implementation, promptfoo has llm-rubric, OpenAI Evals and Braintrust's autoevals offer model-graded scorers, and the observability platforms wire judges into production traces. None of that saves you from the work, because the work isn't the plumbing. It's the rubric and the calibration.

So, the short version of a defensible LLM-as-judge eval: write a rubric specific enough that two humans would grade the same way; validate the judge against a few dozen human labels and report that agreement number next to every score; use pairwise comparison with order-swapping; and use a different model family to judge than the one you're grading, so you're not scoring your own handwriting. Do that and the judge becomes what the benchmark-theater crowd keeps pretending it already is — a measurement instead of a performance.

Skip it, and you'll keep shipping the number on the slide. It will keep going up. And you will have no idea whether anything got better.

Frequently asked

What is LLM-as-a-judge?

It's the practice of using a language model to score or compare the outputs of another model (or the same one), instead of a human rater or an exact-match metric. The judge is given the input, the output, and a rubric, and returns a score or a preferred answer. It scales evaluation cheaply but inherits all the biases of the model doing the judging.

Is pairwise comparison better than scoring each answer 1 to 10?

Usually, yes. Absolute (pointwise) scores drift over time and bunch up in the 7-to-9 range, so small real differences vanish. Pairwise comparison — show the judge two answers and ask which is better — tracks human preference more closely. The catch is that pairwise is more manipulable: research found preferences flipped in about 35% of cases when an irrelevant distractor feature was added, versus about 9% for pointwise scores. Always swap the answer order and only count a win that survives both orderings.

How do I know if my LLM judge is any good?

Validate it against humans before you trust it. Have a person grade 30 to 50 examples by hand, then measure how often the judge agrees with those labels. If agreement is near the human-to-human rate (the MT-Bench work put a strong judge above 80%), the judge is usable for that task; if it's low, fix the rubric or pick a different model. An unvalidated judge produces numbers that look like measurement but are just a second model's untested opinion.

LLM-as-a-Judge: How to Build an Eval That Doesn't Quietly Lie to You

You are not measuring your system. You are measuring an agreement.

The one step almost everyone skips: validate the validator

Prefer pairwise, but know what it costs

What this means for your pipeline

Frequently asked

Priya Sundaram

Dispatches from the machines, in your inbox

LLM-as-a-Judge: How to Build an Eval That Doesn't Quietly Lie to You

You are not measuring your system. You are measuring an agreement.

The one step almost everyone skips: validate the validator

Prefer pairwise, but know what it costs

What this means for your pipeline

Frequently asked

Priya Sundaram

Continue reading

What the Chatbot Era Quietly Abandoned

How to Build an MCP Server: A Practical Guide for Agent Developers

DeepEval vs Ragas vs Promptfoo: Choosing an LLM Eval Framework

Dispatches from the machines, in your inbox