A coding agent you're evaluating gets the test suite green. The final answer is correct, the eval passes, the score goes on a slide. What it doesn't tell you is that the agent wrote the feature, broke an unrelated module, noticed nothing, then got lucky because the test for that module was already skipped. The trajectory was broken. The output was fine. Your judge looked only at the output.

This is the seam where LLM-as-a-judge starts to fail, quietly. For a single LLM call — summarize this, classify that — the output is the work, and grading the output is the right move. An agent is a process: a sequence of tool calls, intermediate artifacts, and decisions, any one of which can be wrong while the final answer still looks plausible. Grade only the last line and you've thrown away everything before it.

Output-grading gives you a sparse, late, misleading signal

The reframing is small and changes everything: LLM-as-a-judge evaluates the output; Agent-as-a-judge evaluates the process. For a multi-step agent, output-only grading has three defects, and they compound.

A final-answer judge can't tell the agent that reasoned correctly from the one that guessed correctly. To that judge they are the same agent. They are not.

The inverse failure is just as common: the agent fails a perfectly evaluable subtask — never saved the checkpoint, never wrote the file it claimed to — but the summary reads plausibly and the output-judge waves it through. The checkable thing went unchecked.

What Agent-as-a-Judge actually adds

The 2024 paper Agent-as-a-Judge: Evaluate Agents with Agents (Zhuge et al., Meta AI and KAUST) is the cleanest statement of the alternative — an "organic extension" of LLM-as-a-judge. Instead of one model scoring one answer, an agentic judge inspects the whole trajectory: reading intermediate files, reconstructing what happened, and checking it against a structured set of requirements.

To test it they built DevAI, a benchmark of 55 realistic AI-development tasks annotated with 365 hierarchical user requirements. The hierarchy is the point: each task isn't one pass/fail but a tree of checkable requirements — exactly the intermediate ground truth a process-judge needs. They ran three open-source agents (MetaGPT, GPT-Pilot, OpenHands) and judged them at the requirement level instead of the finish line.

The discrimination improved: Agent-as-a-Judge agreed with human expert evaluators about 90% of the time, against roughly 70% for LLM-as-a-judge on the same tasks, per the paper's reporting. Twenty points is the difference between an eval you trust and one you argue with.

It was also cheap relative to the thing it replaces. Human evaluation of DevAI took three evaluators a self-reported ~86.5 hours, costing about $1,297 at a $15/hour expert wage; Agent-as-a-Judge did it in about two hours for $30.58 in API calls — roughly a 97% cut in both. But read that for what it is: Agent-as-a-Judge versus humans, not versus LLM-as-a-judge. Against an LLM judge, the agentic judge is the more expensive option.


The catch nobody puts on the slide

Step-level evaluation is a better signal. It's also a bigger bill and a second system to maintain.

You need intermediate ground truth. DevAI's 365 requirements didn't write themselves — they're hand-authored annotations. A dense per-step signal needs a rubric or label for each step you grade, and authoring those is the real cost: output-grading needs one reference answer, process-grading needs a structured map of the whole task.

The judge agent can be wrong. Everything that makes LLM judges unreliable — position bias, verbosity bias, self-enhancement, brittleness to prompt phrasing, catalogued in Justice or Prejudice? — doesn't vanish when the judge becomes an agent. It multiplies across every step, and a 90% agreement rate is also a 10% disagreement rate. You still validate the judge against humans — now with more surface to cover.

You've added a second agent to debug. The judge is now an agentic system with its own tool calls and failure modes. When it disagrees with you, "is the agent wrong or is the judge wrong?" gets genuinely hard — because both are agents.

So the tradeoff is not "better eval, full stop." It's signal density versus evaluation cost and complexity — a dense, located, early signal that doubles as a reward signal for self-improvement, paid for in annotation labor, judge cost, and a second system to keep honest.

The decision rule

Start with LLM-as-a-judge on outputs. It's cheaper, it's one system, and for many agent tasks the final answer carries enough signal. Build the rubric, validate the judge against a few dozen human labels, and ship it.

Graduate to trajectory evaluation when output-grading stops discriminating — when good and bad agents post the same final-answer score, when you can't tell the lucky run from the correct one, or when you need to know which step broke so you can fix it or reward it. That's when the sparse signal has run out of resolution and step-level ground truth starts to pay for itself.

Until that moment, a trajectory judge is an expensive way to learn what a cheaper one already told you. After it, it's the only thing that tells good process from a good guess — and for an agent, that difference is the whole game.