The first eval people write for an agent is almost always wrong, and it's wrong in an instructive way. You build an agent that books a flight. You record the run you consider perfect: it called search_flights, then get_seat_map, then book. You save that as the golden trajectory and grade future runs by exact match. It feels rigorous. Then a perfectly good run calls get_seat_map before search_flights because the user already named the flight, and your eval marks it failed.
That's the core mistake, and it comes from importing a habit that doesn't transfer. Evaluating an LLM means grading one output: was the answer correct, grounded, safe. Evaluating an agent means grading two things — the outcome and the trajectory, the sequence of tool calls it made along the way. Both Google's Vertex AI eval service and LangSmith split agent metrics into exactly these two categories, because the path is now part of what can be right or wrong. The flight got booked (outcome fine) but maybe the agent called a paid search API eleven times to do it (trajectory bad). You cannot see that failure in the final answer.
Why the golden trajectory is the wrong ruler
The reason exact-match fails isn't a tuning problem you can fix with a better golden path. It's structural: most tasks have many correct trajectories. Tools that don't depend on each other can run in any order. Some runs need a clarifying lookup that others can skip because the context already had the answer. A retry after a transient 500 is not a defect. The space of valid paths is large and the space of paths you thought to record is one. Grade against that one, and you spend your week explaining why your "failing" agent is actually fine.
The benchmarks that the field actually trusts gave up on path-matching years ago. τ-bench, Sierra's tool-agent-user benchmark, is the cleanest example: its reward function compares the final database state to an annotated goal state and deliberately ignores how the agent got there. Did the customer's order end up cancelled and refunded? Pass. The route is the agent's business. (τ-bench also contributes the metric worth stealing for reliability: pass^k, the probability that all k independent attempts succeed — because an agent that books the flight 7 times out of 8 is not one you can put in production.)
Grade invariants, then the outcome
So if not the exact path, what? The move that scales is to stop asserting which path and start asserting what must be true of any path. Invariants — properties that hold regardless of the route the agent chose:
- It never called the destructive tool without confirmation.
delete_account,issue_refund,send_emailshould never appear before the approval step. This is a subset check — assert the agent stayed within the allowed set. - It never put a secret or PII into a tool argument. Scan the arguments, not just the final text, for the API key or the customer's SSN leaking into an outbound call.
- Every tool call had schema-valid arguments. Not "did it pick the right tool" but "did it call the tool correctly" — the parameter-extraction failure mode that Arize Phoenix's function-call eval checks directly.
- It converged within N steps and didn't loop, thrash, or call a tool the task never needed.
These are assertions, not similarity scores, and that's the point — each is a yes/no that survives any reordering of the valid path. Then, separately, you check the outcome: did the final state come out right? Invariants catch the dangerous and wasteful paths; the outcome check catches the wrong destination. Together they grade the agent the way you'd grade a junior employee — did you get it done, and did you do anything reckless on the way — rather than did you retrace my exact steps.
A correct agent that took a different valid path is not a bug in the agent. It's a bug in your eval. Assert what must be true on every path, and the final state — never the one path you happened to record.
Where the tools land
The good news is that the major eval stacks already speak this language, if you choose the loose end of their dials. LangSmith's agentevals ships four match modes — strict, unordered, superset, subset — and the last two are exactly the invariant-style assertions above (key tools were called; no tools beyond the expected set). DeepEval's ToolCorrectnessMetric and TaskCompletionMetric separate "right tools" from "task done." Google's ADK exposes tool_trajectory_avg_score with EXACT / IN_ORDER / ANY_ORDER matching, so you can dial off strict. And for tool-use capability in the abstract — picking and calling functions correctly across multi-turn, stateful tasks — the Berkeley Function-Calling Leaderboard is the standard reference, with its v3 state-tracking and v4 agentic categories.
There's a deeper current worth naming: outcome-only grading has its own critics, who point out that a pure end-state reward gives you almost no signal about why a long run failed (the credit-assignment problem that motivates process-reward models for agents). They're right that outcome-only is thin for training. But for evaluation — deciding whether to ship — invariants plus outcome is the honest combination: it refuses to punish the agent for creativity it's allowed to have, while refusing to let it leak a secret or burn your API budget on the way to a correct-looking answer.
If you've already built a trustworthy LLM-as-judge for your outputs, point part of it at the trajectory too — but give it the invariants as its rubric, not a golden path to mimic. The same discipline that makes a RAG eval predict quality applies here: measure the property you actually care about, not the artifact that's easiest to diff.



