Multi-Agent vs Single-Agent: When More Agents Actually Help

If you read the two most-quoted essays on how to structure an AI agent back to back, you come away thinking the field cannot agree on anything. Cognition — the team behind Devin — published a piece called "Don't Build Multi-Agents". Anthropic published "How we built our multi-agent research system" and reported it crushing the single-agent baseline. One title is an instruction not to do the thing the other title is a case study in doing.

The instinct is to pick a side. Resist it. Both teams measured carefully, and both are reporting the truth about their workload. The reconciliation is more useful than either essay alone, and it comes down to a single question you can ask before you write a line of orchestration code.

The case against: incoherence is the default failure

Cognition's argument is not "multi-agent is slow" or "multi-agent is expensive." It is that multi-agent systems are fragile in a specific, predictable way. They name two principles. First, share context — and not just the last message, but full agent traces. Second, actions carry implicit decisions, and conflicting decisions carry bad results.

The example they use is building a clone of Flappy Bird with parallel subagents. One subagent renders a background in the visual style of Super Mario; another builds a bird that looks nothing like the world it is dropped into. Neither did anything wrong on its own terms. They simply never saw each other's work, so each made a thousand small unstated choices — art direction, physics, naming — that the other contradicted. You cannot reconcile the pieces at the end because the conflicts are baked into decisions nobody wrote down.

That is the core of it. When subtasks are tightly coupled — when every choice constrains every other choice — splitting them across agents that don't share full context manufactures disagreement. Cognition's recommendation: keep it single-threaded and linear, so context stays continuous and decisions compound instead of colliding.

The case for: parallel reading beats serial reading

Now Anthropic's post. They built an orchestrator-worker system: a lead agent (Claude Opus) that decomposes a query and spins up three to five subagents (Claude Sonnet) to chase independent threads in parallel, each with its own context window. On their internal research eval, this configuration outperformed a single-agent Opus by 90.2%.

The illustrative task is the mirror image of Flappy Bird: find every board member of the companies in the S&P 500's information-technology sector. The single agent grinds through it sequentially and fails. The multi-agent system splits the list, searches in parallel, and reconverges. Critically, the subtasks here are not coupled. One subagent's findings about Company A do not constrain another's findings about Company B. The lead agent just aggregates.

Token usage alone explained about 80% of the performance variance on their browsing evaluations.

That line is the whole game. The gains came from doing more searching, faster, across more context than a single window holds — not from agents negotiating with each other.

The variable that decides it: coupling

Stack the two examples and the rule writes itself.

Flappy Bird is write-heavy and stateful. You are building one coherent artifact. Every decision depends on the others. Coupling is high. Split it and you get incoherence.
S&P 500 board members is read-heavy and parallelizable. You are gathering and aggregating. Subresults are independent. Coupling is low. Split it and you get speed.

Multi-agent is not better or worse than single-agent. It is a parallelization strategy, and parallelization pays only when the work decomposes cleanly. The deciding variable is how tightly your subtasks are coupled — which is exactly why coding, the canonical write-heavy task, is where Cognition drew its line, and research, the canonical read-heavy task, is where Anthropic drew its win.

Multi-agent wins when subtasks barely talk to each other. It loses the moment they have to.

The second gate: tokens

Even with low coupling, there is a toll. Anthropic reports their multi-agent system burns roughly 15x the tokens of an ordinary chat. That is not a rounding error; it is the entire economics of the decision. A 90% quality lift that costs 15x is a bargain for high-value research and a disaster for a routine summarization job. So the test is two gates, not one: the subtasks must be loosely coupled, and the task value must clear the token premium. Fail either and you should be running a single agent.

It is worth noting Cognition itself later published "Multi-Agents: What's Actually Working", describing patterns where multiple agents contribute intelligence but writes stay single-threaded. That is not a reversal. It is the same rule from the other side: read in parallel, write in sequence.

What this means for your stack

If you are reaching for a framework — LangGraph, CrewAI, whatever the month's favorite is — the architecture diagram is not the decision. The decision is made before you open the framework, by classifying your task. Read-heavy and decomposable: an orchestrator fanning out to workers earns its keep. Write-heavy and stateful: one continuous agent, and pour your effort into context engineering and agent memory so that single thread holds everything it needs.

The two famous essays were never really arguing. They were describing two different jobs and using the same word for both. Ask whether your task reads or writes, count what the tokens cost, and the answer stops being a debate.

Frequently asked

When should I use a multi-agent system instead of a single agent?

When the task is read-heavy and decomposes into loosely-coupled subtasks — independent search or exploration that just gets aggregated — and its value clears the roughly 15x token premium multi-agent incurs. For write-heavy, stateful work where every decision depends on the others, a single continuous agent is safer.

Why do Cognition and Anthropic seem to disagree about multi-agent systems?

They measured different workloads. Cognition's case (coding) is write-heavy and tightly coupled, where splitting work across agents that don't share full context produces incoherence. Anthropic's case (research) is read-heavy and parallelizable, where parallel subagents outperformed a single agent. The deciding variable is task coupling, not the architecture itself.

How much more expensive are multi-agent systems?

Anthropic reported their multi-agent research system used roughly 15x the tokens of an ordinary chat interaction. That premium is the core economics of the decision: a large quality gain can justify it for high-value tasks, but it makes multi-agent a poor fit for routine, low-value work.

Dimension	Single-agent	Multi-agent
Shape	One continuous context thread	Orchestrator fanning out to workers
Wins on	Write-heavy, stateful, tightly-coupled tasks	Read-heavy, parallelizable, loosely-coupled tasks
Failure mode	Slow on wide parallel search	Incoherence from conflicting unstated decisions
Token cost	Baseline	~15x a single chat (Anthropic)
Canonical example	Building one coherent artifact (Flappy Bird)	Gathering S&P 500 board members in parallel
Decide by	Default unless the task clearly decomposes	Subtasks loosely coupled AND value clears the token premium

Multi-Agent vs Single-Agent: When More Agents Actually Help

The case against: incoherence is the default failure

The case for: parallel reading beats serial reading

The variable that decides it: coupling

The second gate: tokens

What this means for your stack

Frequently asked

Dex Mareno

Dispatches from the machines, in your inbox

Multi-Agent vs Single-Agent: When More Agents Actually Help

The case against: incoherence is the default failure

The case for: parallel reading beats serial reading

The variable that decides it: coupling

The second gate: tokens

What this means for your stack

Frequently asked

Dex Mareno

Continue reading

RAG vs Long Context: When to Retrieve and When to Stuff the Window

pgvector vs Pinecone vs Qdrant: Picking a Vector Database in 2026

MCP vs Function Calling: When You Actually Need a Server

Dispatches from the machines, in your inbox