In 2023, Yilun Du and co-authors published a genuinely appealing idea: instead of trusting one language model's first answer, spin up several instances, let them propose answers and argue over each other's reasoning for a few rounds, and take the consensus. On math and factual-reasoning benchmarks it worked — accuracy went up, hallucinations went down. The result was intuitive enough that "multi-agent debate" (MAD) became a default reach for anyone trying to squeeze more correctness out of a model.
Three years of follow-up work has clarified something the original framing obscured. Debate does beat the baseline it was measured against. It's just the wrong baseline.
The comparison that flatters debate#
Almost every debate demo compares MAD to a single greedy chain-of-thought answer: one model, one pass, whatever it says first. Against that, of course debate wins — you've replaced one sample with a dozen and added rounds of revision. But debate isn't free. Routing one query to N agents over R rounds spends three to five times the tokens of a single pass (iMAD). So the honest question is not "debate vs one cheap answer." It's: given a fixed compute budget, do those extra tokens do more as debate — or as something simpler?
The simplest something is self-consistency: sample the same model N times independently and take the majority answer. Same token bill, no orchestration, no message-passing. And on that head-to-head, the argument for debate mostly evaporates.
What holding compute constant reveals#
The iMAD survey pins MAD's gain over chain-of-thought at roughly 1.5% to 5.3% — real, but modest for a 3-5x cost. More pointed, a 2026 study with the deadpan title The Cost of Consensus finds that within the 7-8B model class, isolated self-correction — a model checking its own work, no debate partners — offers a better cost-accuracy tradeoff than unguided homogeneous debate. And across Qwen3, DeepSeek-R1-Distill, and Gemini 2.5, single agents match or exceed multi-agent setups once you actually control for compute. A recurring finding in this literature is that many reported multi-agent "wins" are better explained by unaccounted-for extra computation than by any benefit of coordination itself.
Most of debate's headline gain isn't the agents cooperating. It's the tokens. Hold the token budget flat and the cooperation frequently nets to zero — or worse.
The failure mode solo sampling can't have#
Worse, because debate has a downside self-consistency structurally cannot. When independent samples vote, a wrong sample is just outvoted; it has no way to reach into another sample and change it. In a debate, it does. Multiple studies document agents that had the correct answer revising to an incorrect one under peer pressure — a confident neighbor disagrees, and the right answer folds. This happens even when the stronger models outnumber the weaker ones. Long debates add a second pathology, problem drift: over successive rounds the conversation wanders off the question it was supposed to answer.
Both failures come from the same design choice — letting the samples influence each other. That influence is sold as the feature. It's also the bug. It sits downstream of an older, uncomfortable result: LLMs largely cannot reliably self-correct reasoning without an external signal telling them they're wrong. Debate supplies social pressure, not ground truth, and a model that can't tell right from wrong on its own can't reliably tell it from a peer's confident assertion either.
Where debate still earns its bill#
This isn't a case for never running more than one agent — it's a case against the specific pattern of identical agents told to reach consensus. Debate keeps its value in a narrower band, and the band has a shape: heterogeneous roles plus a grounding signal. A solver paired with an agent whose entire job is to refute — not to agree — on a task with a checkable answer (code that runs, math that verifies, a claim you can ground against a retrieved source) is no longer averaging toward the mean. It's adversarial verification, and adversarial verification does add signal precisely because the critic isn't trying to converge.
So the practical rule inverts the default. Reach first for self-consistency or best-of-N with a verifier — it captures most of the accuracy for none of the peer-pressure risk and none of the orchestration. Reach for debate only when you can give it two things the homogeneous version lacks: role diversity, so the agents aren't just nodding, and a way to check the answer, so consensus has to survive contact with reality instead of just outvoting it.
The question worth asking before you wire up five arguing agents isn't "will they do better than one?" It's "will they do better than one, sampled five times, that never had to listen to the other four?" For most tasks, the answer is no.



