The Confidence Interval Ate the Leaderboard

Here is a number that should bother anyone who picks models for a living. On GPQA Diamond — the graduate-level science benchmark that frontier labs quote as evidence of reasoning — the reported gap between the top two models this spring was about three tenths of a percentage point. The test has 198 questions. Three tenths of a point is less than one question.

The leaderboard prints that gap as a ranking. One model is "first," the other "second." But the ranking is describing the outcome of a coin landing on its edge, and calling the edge a winner.

The arithmetic the leaderboards skip

A benchmark score is a sample statistic. You ask a model some fixed number of questions, count how many it gets right, and divide. Like any proportion estimated from a finite sample, it carries a margin of error, and that margin depends almost entirely on one thing: how many questions there are.

The standard error of a proportion is sqrt(p·(1−p)/n). Run the numbers on the two benchmarks everyone quotes:

GPQA Diamond. 198 questions. A model scoring around 94% has a standard error of sqrt(0.94·0.06/198) ≈ 1.7 points. The 95% confidence interval is roughly ±3.3 points.
SWE-bench Verified. 500 problems. A model around 81% has a standard error of sqrt(0.81·0.19/500) ≈ 1.8 points. The 95% interval is about ±3.4 points.

Now lay the reported scores on top. On GPQA Diamond, the leaders cluster around 94.3–94.6%. On SWE-bench Verified, the top coding models sit at 80.8–80.9%. The gaps between rank one and rank two are 0.3 and 0.1 points. The uncertainty on each individual score is more than three points. The signal is an order of magnitude smaller than the noise it's swimming in.

A 0.1-point lead on a 500-problem test is half a problem. We are publishing standings decided by half a problem and reading them as if they were standings.

To call a difference real, you don't compare it to one model's interval — you test the difference of two proportions, whose standard error is larger still, about sqrt(2) times a single one. For these benchmarks that puts the threshold for a statistically distinguishable result somewhere near a five-point gap. Below that, two models are tied, and no amount of decimal places in the press release changes it. Nearly every frontier-vs-frontier comparison published in 2026 is below that line.

It's worse than sampling error

Sampling error is the optimistic version of the problem, because it assumes the only thing varying is which questions you happened to pick. Two other sources of variance sit underneath it, and both are usually larger.

The first is run-to-run nondeterminism. Ask the same model the same 198 questions twice and you will not get the same score, because sampling temperature, tool-call timing, and harness retries all move a few answers each pass. Labs that report best-of-several runs are quietly widening the gap between the number on the slide and the number you'd reproduce.

The second is harness variance, which Epoch AI has documented carefully: the same model on the same benchmark can swing several points depending on the scaffolding — the prompt template, how partial credit is graded, whether the agent gets to retry a failed patch. On SWE-bench in particular, the evaluation harness can be worth more points than the model upgrade it's supposedly measuring. When the measurement apparatus has a bigger effect than the thing being measured, the ranking is an artifact of the apparatus.

What this is not saying

It is not saying the models are all the same. It is saying these tests can no longer tell them apart. That is the natural endpoint of a benchmark: a test designed so PhDs score around 65% is genuinely informative when models score 50% and genuinely useless when they all score 94%, because the remaining 6% is where the broken questions, the ambiguous answer keys, and the contamination live. Saturation doesn't mean the work is done. It means the ruler has run out of marks.

The benchmarks that still discriminate are the ones with headroom: Humanity's Last Exam, hard agentic task suites, anything contamination-resistant where scores are low enough that a real gap clears the margin of error. Those are worth reading. A two-point move at the top of a saturated test is not — it's the same coin, landing on its edge again.

So when you choose a model this quarter, stop sorting by rank. The top five are, to the precision the tests can actually support, tied. Sort by the things that are measured without a confidence interval swallowing them whole: dollars per million tokens, latency at your p95, context window, whether it fails your own evals on your own data. Those numbers are real. The 0.3-point lead is a rounding error wearing a medal.

The Confidence Interval Ate the Leaderboard

The arithmetic the leaderboards skip

It's worse than sampling error

What this is not saying

Priya Sundaram

Dispatches from the machines, in your inbox

The Confidence Interval Ate the Leaderboard

The arithmetic the leaderboards skip

It's worse than sampling error

What this is not saying

Priya Sundaram

Continue reading

The Price Fell. The Bill Rose. Both Numbers Are True.

The Megawatt You Cannot Rent

The Duty of Care Died Before Anyone Had to Meet It

Dispatches from the machines, in your inbox