You wire up a supervisor, three sub-agents, a retriever, and a handful of tools. You point your eval suite at it — the same suite that scored your single agent — and it prints a number: 71% task success. The number goes up when you tune a prompt, down when you don't. It feels like evaluation. It is not. It is a smoke alarm with one light, and the light is on.

The problem isn't the score's value. It's the score's resolution. A multi-agent system doesn't fail the way a single model fails. It fails at the seams.

The failures live on the boundaries#

The whole premise of a multi-agent system is that you've split one hard job across specialists that pass work between them. A router decides who handles a request. A sub-agent does a piece and hands the result back. A tool returns a payload the next agent has to interpret. Each of those transfers is a place where state can be dropped, mangled, or misrouted — and, empirically, that's where the failures cluster. A wrong retrieval quietly changes the plan. A bad tool argument corrupts a result three steps downstream. A sub-agent handoff loses the context that made the request answerable.

An end-to-end score sees the wreck but not the collision. It tells you that something broke and refuses to say which.

None of this shows up in the final answer as anything but "wrong." And plenty of it doesn't show up as wrong at all: a run can pass end-to-end while a handoff drops half its context, because the downstream agent papered over the gap with a plausible guess that happened to land. That's the worst case — a green run built on a broken seam that a slightly different input will expose in production, after your eval suite has signed off.

Evaluate at three levels, not one#

The fix is to stop treating the system as a single black box and score it at three resolutions (Confident AI frames these cleanly):

Run only the first and you're flying blind. Run all three and a red end-to-end result comes with a trail: the plan was fine, the handoff to the research agent was fine, the retrieval it depended on returned the wrong document. Now you have a bug, not a mood.

The four dimensions single-agent rubrics don't have#

On top of those levels, multi-agent coordination adds dimensions a solo-agent scorecard never needed:

The metric that actually matters#

If you take one thing: make failure attribution accuracy your headline metric. Not task success — attribution. How reliably can your harness, handed a failed run, name the component that caused it?

This is a strange metric because it isn't about the model at all. It's a property of your tracing and your evals. But it's the one that governs everything downstream. If attribution is good, every regression is a fix; if it's bad, every regression is a manual bisect through a trace, and your mean-time-to-repair scales with the number of agents you added — which means the architecture you adopted to move faster is now the reason you move slower. The ICSE 2026 AGENT catalogue lists 37 metrics across Outcome, Process, Product, and Framework categories; the point of that sprawl isn't to make you compute all 37. It's that "did it work" is one metric in one category, and you've been living on it alone.

Split the work across agents and the outcome stops being a diagnosis. Maxim's failure-pattern work makes the same case from the reliability side: the emergent failures — individually reasonable agents producing a collectively wrong result — are invisible to any held-out set and only surface under online evaluation on real traffic. Keep the end-to-end gate in CI. But measure the seams, and measure whether your harness can find them. The system you built is a set of boundaries. Evaluate the boundaries.