How to Evaluate an AI Agent's Tool Use, Not Just Its Answer

The first eval people write for an agent is almost always wrong, and it's wrong in an instructive way. You build an agent that books a flight. You record the run you consider perfect: it called search_flights, then get_seat_map, then book. You save that as the golden trajectory and grade future runs by exact match. It feels rigorous. Then a perfectly good run calls get_seat_map before search_flights because the user already named the flight, and your eval marks it failed.

That's the core mistake, and it comes from importing a habit that doesn't transfer. Evaluating an LLM means grading one output: was the answer correct, grounded, safe. Evaluating an agent means grading two things — the outcome and the trajectory, the sequence of tool calls it made along the way. Both Google's Vertex AI eval service and LangSmith split agent metrics into exactly these two categories, because the path is now part of what can be right or wrong. The flight got booked (outcome fine) but maybe the agent called a paid search API eleven times to do it (trajectory bad). You cannot see that failure in the final answer.

Why the golden trajectory is the wrong ruler

The reason exact-match fails isn't a tuning problem you can fix with a better golden path. It's structural: most tasks have many correct trajectories. Tools that don't depend on each other can run in any order. Some runs need a clarifying lookup that others can skip because the context already had the answer. A retry after a transient 500 is not a defect. The space of valid paths is large and the space of paths you thought to record is one. Grade against that one, and you spend your week explaining why your "failing" agent is actually fine.

The benchmarks that the field actually trusts gave up on path-matching years ago. τ-bench, Sierra's tool-agent-user benchmark, is the cleanest example: its reward function compares the final database state to an annotated goal state and deliberately ignores how the agent got there. Did the customer's order end up cancelled and refunded? Pass. The route is the agent's business. (τ-bench also contributes the metric worth stealing for reliability: pass^k, the probability that all k independent attempts succeed — because an agent that books the flight 7 times out of 8 is not one you can put in production.)

Grade invariants, then the outcome

So if not the exact path, what? The move that scales is to stop asserting which path and start asserting what must be true of any path. Invariants — properties that hold regardless of the route the agent chose:

It never called the destructive tool without confirmation. delete_account, issue_refund, send_email should never appear before the approval step. This is a subset check — assert the agent stayed within the allowed set.
It never put a secret or PII into a tool argument. Scan the arguments, not just the final text, for the API key or the customer's SSN leaking into an outbound call.
Every tool call had schema-valid arguments. Not "did it pick the right tool" but "did it call the tool correctly" — the parameter-extraction failure mode that Arize Phoenix's function-call eval checks directly.
It converged within N steps and didn't loop, thrash, or call a tool the task never needed.

These are assertions, not similarity scores, and that's the point — each is a yes/no that survives any reordering of the valid path. Then, separately, you check the outcome: did the final state come out right? Invariants catch the dangerous and wasteful paths; the outcome check catches the wrong destination. Together they grade the agent the way you'd grade a junior employee — did you get it done, and did you do anything reckless on the way — rather than did you retrace my exact steps.

A correct agent that took a different valid path is not a bug in the agent. It's a bug in your eval. Assert what must be true on every path, and the final state — never the one path you happened to record.

Where the tools land

The good news is that the major eval stacks already speak this language, if you choose the loose end of their dials. LangSmith's agentevals ships four match modes — strict, unordered, superset, subset — and the last two are exactly the invariant-style assertions above (key tools were called; no tools beyond the expected set). DeepEval's ToolCorrectnessMetric and TaskCompletionMetric separate "right tools" from "task done." Google's ADK exposes tool_trajectory_avg_score with EXACT / IN_ORDER / ANY_ORDER matching, so you can dial off strict. And for tool-use capability in the abstract — picking and calling functions correctly across multi-turn, stateful tasks — the Berkeley Function-Calling Leaderboard is the standard reference, with its v3 state-tracking and v4 agentic categories.

There's a deeper current worth naming: outcome-only grading has its own critics, who point out that a pure end-state reward gives you almost no signal about why a long run failed (the credit-assignment problem that motivates process-reward models for agents). They're right that outcome-only is thin for training. But for evaluation — deciding whether to ship — invariants plus outcome is the honest combination: it refuses to punish the agent for creativity it's allowed to have, while refusing to let it leak a secret or burn your API budget on the way to a correct-looking answer.

If you've already built a trustworthy LLM-as-judge for your outputs, point part of it at the trajectory too — but give it the invariants as its rubric, not a golden path to mimic. The same discipline that makes a RAG eval predict quality applies here: measure the property you actually care about, not the artifact that's easiest to diff.

Frequently asked

How is evaluating an AI agent different from evaluating an LLM?

An LLM eval grades a single output — was the answer correct, grounded, safe. An agent eval also grades the trajectory: the sequence of tool calls, arguments, and intermediate steps the agent took to get there. Google's Vertex AI and LangSmith both split agent metrics into final-response evaluation and trajectory evaluation for this reason.

What is trajectory evaluation?

Scoring the path an agent took, not just its final answer — which tools it called, in what order, with what arguments. Frameworks offer match modes from strict (same calls, same order) to unordered and superset/subset, plus LLM-as-judge over the whole trace.

Should I grade an agent against a golden trajectory?

Usually not, by exact match. Most tasks have several equally correct tool-call sequences, so strict matching fails correct agents that chose a different valid path. Prefer loose matchers, invariant assertions that must hold on any path, and an outcome check on the final state (the approach τ-bench takes by grading the final database state).

Match strategy	What it checks	When it's right
Strict / exact	Same tool calls, same order, same args	Rigid, single-path workflows only
In-order (superset)	Expected tools appear in order; extras allowed	Pipelines with a required spine
Any-order	Expected tools all called, order ignored	Independent subtasks, parallel tools
Subset	No tools called beyond the expected set	Safety: catching unnecessary/risky calls
Invariants + outcome	Properties hold on every path; final state correct	The default for open-ended agents

How to Evaluate an AI Agent's Tool Use, Not Just Its Answer

Why the golden trajectory is the wrong ruler

Grade invariants, then the outcome

Where the tools land

Frequently asked

Priya Sundaram

Dispatches from the machines, in your inbox

How to Evaluate an AI Agent's Tool Use, Not Just Its Answer

Why the golden trajectory is the wrong ruler

Grade invariants, then the outcome

Where the tools land

Frequently asked

Priya Sundaram

Continue reading

Claude vs GPT vs Gemini for AI Agents in 2026: Choosing a Model for Tool Use

garak vs PyRIT vs promptfoo: Which LLM Red-Teaming Tool to Actually Use

How to Detect LLM Hallucinations: Faithfulness Is Not Factuality

Dispatches from the machines, in your inbox