You wire up a supervisor, three sub-agents, a retriever, and a handful of tools. You point your eval suite at it — the same suite that scored your single agent — and it prints a number: 71% task success. The number goes up when you tune a prompt, down when you don't. It feels like evaluation. It is not. It is a smoke alarm with one light, and the light is on.
The problem isn't the score's value. It's the score's resolution. A multi-agent system doesn't fail the way a single model fails. It fails at the seams.
The failures live on the boundaries#
The whole premise of a multi-agent system is that you've split one hard job across specialists that pass work between them. A router decides who handles a request. A sub-agent does a piece and hands the result back. A tool returns a payload the next agent has to interpret. Each of those transfers is a place where state can be dropped, mangled, or misrouted — and, empirically, that's where the failures cluster. A wrong retrieval quietly changes the plan. A bad tool argument corrupts a result three steps downstream. A sub-agent handoff loses the context that made the request answerable.
An end-to-end score sees the wreck but not the collision. It tells you that something broke and refuses to say which.
None of this shows up in the final answer as anything but "wrong." And plenty of it doesn't show up as wrong at all: a run can pass end-to-end while a handoff drops half its context, because the downstream agent papered over the gap with a plausible guess that happened to land. That's the worst case — a green run built on a broken seam that a slightly different input will expose in production, after your eval suite has signed off.
Evaluate at three levels, not one#
The fix is to stop treating the system as a single black box and score it at three resolutions (Confident AI frames these cleanly):
- End-to-end. Did the system solve the task, judged against a golden answer or a rubric? This is your regression gate. It's necessary and it is not sufficient.
- Trajectory-level. Was the path sound — the plan, the reasoning steps, the tool calls, the retries, the handoffs, in order? This is what separates a correct answer from a lucky one. Two runs can both pass end-to-end; only trajectory evals tell you which one will keep passing.
- Component-level. Did this one span do its job — this retrieval, this tool argument, this sub-agent's reply — scored in isolation? This is where a failure finally gets an address.
Run only the first and you're flying blind. Run all three and a red end-to-end result comes with a trail: the plan was fine, the handoff to the research agent was fine, the retrieval it depended on returned the wrong document. Now you have a bug, not a mood.
The four dimensions single-agent rubrics don't have#
On top of those levels, multi-agent coordination adds dimensions a solo-agent scorecard never needed:
- Orchestration correctness — did the router pick the right specialist for the request? (The failure modes here depend heavily on your topology — supervisor, swarm, or handoff each break differently.)
- Handoff accuracy — was state passed across the boundary intact, or truncated and reinterpreted?
- Failure attribution — which agent or coordination step caused the observable failure?
- Coordination cost — how many extra turns, tokens, and messages the collaboration spent getting there. A system that succeeds by looping four agents through eleven turns is a latency and cost bug wearing a passing grade.
The metric that actually matters#
If you take one thing: make failure attribution accuracy your headline metric. Not task success — attribution. How reliably can your harness, handed a failed run, name the component that caused it?
This is a strange metric because it isn't about the model at all. It's a property of your tracing and your evals. But it's the one that governs everything downstream. If attribution is good, every regression is a fix; if it's bad, every regression is a manual bisect through a trace, and your mean-time-to-repair scales with the number of agents you added — which means the architecture you adopted to move faster is now the reason you move slower. The ICSE 2026 AGENT catalogue lists 37 metrics across Outcome, Process, Product, and Framework categories; the point of that sprawl isn't to make you compute all 37. It's that "did it work" is one metric in one category, and you've been living on it alone.
Split the work across agents and the outcome stops being a diagnosis. Maxim's failure-pattern work makes the same case from the reliability side: the emergent failures — individually reasonable agents producing a collectively wrong result — are invisible to any held-out set and only surface under online evaluation on real traffic. Keep the end-to-end gate in CI. But measure the seams, and measure whether your harness can find them. The system you built is a set of boundaries. Evaluate the boundaries.



