If you read the two most-quoted essays on how to structure an AI agent back to back, you come away thinking the field cannot agree on anything. Cognition — the team behind Devin — published a piece called "Don't Build Multi-Agents". Anthropic published "How we built our multi-agent research system" and reported it crushing the single-agent baseline. One title is an instruction not to do the thing the other title is a case study in doing.
The instinct is to pick a side. Resist it. Both teams measured carefully, and both are reporting the truth about their workload. The reconciliation is more useful than either essay alone, and it comes down to a single question you can ask before you write a line of orchestration code.
The case against: incoherence is the default failure
Cognition's argument is not "multi-agent is slow" or "multi-agent is expensive." It is that multi-agent systems are fragile in a specific, predictable way. They name two principles. First, share context — and not just the last message, but full agent traces. Second, actions carry implicit decisions, and conflicting decisions carry bad results.
The example they use is building a clone of Flappy Bird with parallel subagents. One subagent renders a background in the visual style of Super Mario; another builds a bird that looks nothing like the world it is dropped into. Neither did anything wrong on its own terms. They simply never saw each other's work, so each made a thousand small unstated choices — art direction, physics, naming — that the other contradicted. You cannot reconcile the pieces at the end because the conflicts are baked into decisions nobody wrote down.
That is the core of it. When subtasks are tightly coupled — when every choice constrains every other choice — splitting them across agents that don't share full context manufactures disagreement. Cognition's recommendation: keep it single-threaded and linear, so context stays continuous and decisions compound instead of colliding.
The case for: parallel reading beats serial reading
Now Anthropic's post. They built an orchestrator-worker system: a lead agent (Claude Opus) that decomposes a query and spins up three to five subagents (Claude Sonnet) to chase independent threads in parallel, each with its own context window. On their internal research eval, this configuration outperformed a single-agent Opus by 90.2%.
The illustrative task is the mirror image of Flappy Bird: find every board member of the companies in the S&P 500's information-technology sector. The single agent grinds through it sequentially and fails. The multi-agent system splits the list, searches in parallel, and reconverges. Critically, the subtasks here are not coupled. One subagent's findings about Company A do not constrain another's findings about Company B. The lead agent just aggregates.
Token usage alone explained about 80% of the performance variance on their browsing evaluations.
That line is the whole game. The gains came from doing more searching, faster, across more context than a single window holds — not from agents negotiating with each other.
The variable that decides it: coupling
Stack the two examples and the rule writes itself.
- Flappy Bird is write-heavy and stateful. You are building one coherent artifact. Every decision depends on the others. Coupling is high. Split it and you get incoherence.
- S&P 500 board members is read-heavy and parallelizable. You are gathering and aggregating. Subresults are independent. Coupling is low. Split it and you get speed.
Multi-agent is not better or worse than single-agent. It is a parallelization strategy, and parallelization pays only when the work decomposes cleanly. The deciding variable is how tightly your subtasks are coupled — which is exactly why coding, the canonical write-heavy task, is where Cognition drew its line, and research, the canonical read-heavy task, is where Anthropic drew its win.
Multi-agent wins when subtasks barely talk to each other. It loses the moment they have to.
The second gate: tokens
Even with low coupling, there is a toll. Anthropic reports their multi-agent system burns roughly 15x the tokens of an ordinary chat. That is not a rounding error; it is the entire economics of the decision. A 90% quality lift that costs 15x is a bargain for high-value research and a disaster for a routine summarization job. So the test is two gates, not one: the subtasks must be loosely coupled, and the task value must clear the token premium. Fail either and you should be running a single agent.
It is worth noting Cognition itself later published "Multi-Agents: What's Actually Working", describing patterns where multiple agents contribute intelligence but writes stay single-threaded. That is not a reversal. It is the same rule from the other side: read in parallel, write in sequence.
What this means for your stack
If you are reaching for a framework — LangGraph, CrewAI, whatever the month's favorite is — the architecture diagram is not the decision. The decision is made before you open the framework, by classifying your task. Read-heavy and decomposable: an orchestrator fanning out to workers earns its keep. Write-heavy and stateful: one continuous agent, and pour your effort into context engineering and agent memory so that single thread holds everything it needs.
The two famous essays were never really arguing. They were describing two different jobs and using the same word for both. Ask whether your task reads or writes, count what the tokens cost, and the answer stops being a debate.



