You can make a model meaningfully smarter without touching its weights: sample it many times for the same question and keep the best answer. This is the simplest form of test-time scaling, and two techniques dominate it. They look almost identical from a distance — both run the model N times and return one result — and they are constantly conflated. But they differ in exactly one place, and that one difference decides which problems each can solve.
The difference is the selection rule. Self-consistency picks the answer the samples agree on. Best-of-N picks the answer an external judge scores highest. Everything else — the cost, the task fit, the way performance scales with N, the way each one fails — follows from that.
Self-consistency: let the samples vote
Self-consistency, introduced by Wang et al. in 2022, is deceptively plain. Prompt the model with chain-of-thought, but instead of greedily decoding one reasoning path, sample many — the paper uses around 40. Each path may reason differently and reach a different final answer. Then throw the reasoning away and take a majority vote over the final answers. The paper's framing is that it "marginalizes out" the reasoning paths: many roads, and you trust the destination most of them reach.
The gains were large enough to make this standard practice. On PaLM-540B, self-consistency lifted GSM8K from 56.5% to 74.4%. Across benchmarks it added roughly +17.9% on GSM8K, +11.0% on SVAMP, and +12.2% on AQuA over plain chain-of-thought, with bigger models benefiting more.
The catch is structural, not incidental: you can only vote on answers you can compare. Self-consistency needs a discrete, extractable final answer — a number, a multiple-choice letter, a short string — so that "the same answer" is a well-defined thing to count. Ask a model to write a function or draft a paragraph and there is no majority answer; every sample is textually unique, and the vote collapses. Self-consistency is a tool for tasks with a checkable answer shape, and outside that shape it simply doesn't apply.
Self-consistency asks the samples to agree with each other. Best-of-N asks something outside the samples to judge them. That single difference is the whole taxonomy.
Best-of-N: let a judge score
Best-of-N removes the requirement that answers be comparable by bringing in an external scorer. Generate N candidates, run each through a judge, keep the highest-scoring one. The judge can be a trained verifier, a reward model, or — best of all when you have it — an automatic checker like a unit-test suite or a proof checker.
The foundational result is Cobbe et al.'s 2021 GSM8K paper, which trained a verifier to rank sampled solutions. Their striking finding: verification scales better with data than simply fine-tuning the generator, to the point where a 6B-parameter verifier selecting among samples slightly outperformed a fine-tuned 175B model — roughly the benefit of a 30× increase in model size, bought instead with test-time sampling and a small judge.
Because best-of-N never compares samples to each other, it works on exactly the open-ended outputs self-consistency can't touch. Brown et al.'s 2024 "Large Language Monkeys" study makes this vivid in the verifiable regime: on SWE-bench Lite, where a candidate patch can be checked by actually running the tests, DeepSeek-V2-Coder went from 15.9% solved at one sample to 56% at 250 samples. When you can check an answer automatically, repeated sampling plus selection turns raw coverage directly into solve rate.
How they scale — and how they break
This is where the two diverge most sharply, and where the practical decision lives.
Brown et al. measured coverage — the fraction of problems solved by at least one of N samples — and found it climbs roughly log-linearly with N across four orders of magnitude. The model can very often produce a correct answer; the bottleneck is picking it out. And there the two selection rules part ways. With an automatic verifier, more samples keep helping, because a correct sample, once found, can be recognized. With majority vote and no verifier, selection saturates after a few hundred samples: the vote proportions stabilize, so the winner stops changing even as coverage keeps rising, and a rare-but-correct answer that only appears in 1% of samples can never win a popularity contest. Their reward-model selection experiments showed the same plateau, which is the key warning: best-of-N is only as good as its scorer.
That cuts the other way too. A scorer you learn — a reward model — is a proxy for what you actually want, and optimizing hard against a proxy is the textbook setup for Goodhart's law. Gao et al. (2022) measured exactly this: as you crank up best-of-N, the gap between the proxy reward and the true objective widens predictably, and larger N makes the over-optimization worse. The model finds the samples that game the reward model rather than the samples that are good. A verifier you can trust absolutely — unit tests, a formal checker — has no such failure mode, which is why best-of-N shines brightest in code and math and is riskiest with a learned judge.
Snell et al. (2024) tie the strands together: the optimal way to spend a test-time compute budget depends on the problem's difficulty, and a "compute-optimal" allocation across sampling and verification can match a fixed best-of-N baseline with up to ~4× less compute. Sampling more is not automatically better; spending the samples well is the discipline.
Choosing
- Discrete, extractable answer (arithmetic, multiple-choice, classification): self-consistency. It's the cheapest reliable boost and needs no extra model — just a vote.
- Open-ended output (code, prose, structured generation): best-of-N. There's no majority answer to count, so you need a scorer. If you have an automatic checker (tests, a compiler, a proof checker), this is where repeated sampling pays off most.
- Open-ended output, only a learned reward model available: best-of-N still works, but keep N modest and watch for reward hacking — a higher reward score is not the same as a better answer, and the gap grows with N.
- Either way, remember the cost: both are ~N× inference. Self-consistency adds a trivial tally; best-of-N adds a verifier pass per sample. The budget is real, and the reasoning effort you'd spend here trades off against simply letting the model think longer on a single pass.
The clean way to hold all of this: self-consistency is best-of-N where the "verifier" is agreement among the samples themselves — free, but only meaningful when answers are comparable. The moment they aren't, you have to bring a real judge to the table, and then your whole system inherits that judge's blind spots. Sampling N times is the easy part. Deciding who gets to pick the winner is the engineering.



