The number everyone quotes is 95.0. That is Claude Fable 5's score on SWE-bench Verified, the most-cited coding benchmark in the industry, before it was suspended on June 12 under export-control rules. Claude Mythos Preview sits at 93.9. GPT-5.5 reports 82.6. Read the leaderboard top to bottom and you watch a curve do the thing curves do near a ceiling: it slows, it crowds, it stops meaning much. We are inside the last few points of a benchmark that was supposed to be hard.
The story is not that the models got worse. They did not. The story is what happens to that 95 the moment you stop the test from being memorizable.
What "Verified" verifies
SWE-bench Verified is a set of real GitHub issues paired with the test that confirms a fix. A model writes a patch; if the test goes green, it scores. Clean, automatable, and exactly the shape of thing that leaks.
It leaks two ways. First, contamination: these are public repositories with public histories, and a model trained on the open internet has plausibly seen the fix. The SWE-Bench+ audit found that 32.67 percent of successful patches involved solution leakage — the answer was sitting in the issue text or its comments (SWE-Bench+, arXiv 2410.06992). Second, weak graders: a further 31.08 percent of passing patches cleared tests too thin to catch a wrong answer. Filter out the leaked solutions and the thin tests and the same study watched one agent's resolution rate collapse from 12.47 percent to 3.97 percent — fewer than a third of the wins survived contact with a stricter ruler. A separate study using differential patch testing reached the same verdict by another route: a large share of "plausible" patches that pass the suite still do not match what the developer actually fixed (arXiv 2503.15223).
There is also the matter of who is grading whom. On the llm-stats Verified leaderboard, only one of the top hundred entries is independently verified; the other ninety-nine are vendor self-reports. A 95 from the lab that sells the model is a marketing asset wearing a lab coat.
The same models, a harder room
Scale built SWE-bench Pro specifically to resist the gaming — held-out tasks, stronger tests, an environment you cannot pattern-match your way through. Same frontier models, different room:
| Benchmark | Leader / score | What it rewards |
|---|---|---|
| SWE-bench Verified | ~95.0% (Fable 5, self-report) | A patch that turns one test green |
| Terminal-Bench 2.1 | 88.0% (Fable 5) | Multi-step CLI tasks in a live shell |
| SWE-bench Pro (SEAL standardized) | 59.1% (GPT-5.4 xHigh) | The same work, contamination-resistant |
| SWE-bench Pro (private commercial set) | 47.1% (Opus 4.6) | Code it has provably never seen |
The drop from 95 to the high 50s is not noise. It is the contamination and the scaffold inflation, measured. Note the spread within SWE-bench Pro too: Anthropic's own scaffold reports Opus 4.8 at 69.2, Scale's standardized harness puts the field near 59, and Scale's private set drops to 47. Three numbers, all called "best," differing by twenty-two points based on who held the stopwatch and whether the model had seen the room before.
A 95 measures how well a model fits a test it may have already read. The 47 measures something closer to work.
The part that is not about coding
Here is where the easy version of this story — benchmarks are contaminated, real scores are lower — gets more interesting, and where the floor actually is.
OpenAI's GDPval was built to measure economically valuable knowledge work: 1,320 tasks across 44 occupations, graded by occupational experts. The headline is that GPT-5.2 Thinking scored 70.9 percent (GDPval, OpenAI; arXiv 2510.04374). That sounds like the asymptote arriving for white-collar work too.
It is not, and the reason is the whole point. That 70.9 is a win-or-tie rate in a blind pairwise comparison. A grader sees one expert deliverable and one model deliverable, unlabeled, and picks the better one. The model "wins or ties" 71 percent of those coin-flips. That is a measure of single-artifact quality at parity — one slide deck, one memo, one drawing, judged against one human's version of the same. It says nothing about whether the model could be handed the actual job — the ambiguous brief, the missing file, the stakeholder who changes the spec on Thursday — and bring back something finished.
For that, look at METR, which measures the length of task a model can do autonomously at a given reliability. Under Time Horizon 1.1, Claude Opus 4.5's 50-percent horizon is about 4 hours 49 minutes — with a 95 percent confidence interval running from 1h49m to 20h25m (METR). A confidence interval that spans an order of magnitude is the statistical signature of "we cannot reliably say." And 50 percent is a coin flip. The reliability you would actually accept from a colleague — 80, 90 percent — lands on much shorter tasks. The floor is not low quality. It is low reliability over length.
So the two trends are not in tension. They are the same fact seen twice. A benchmark is a graded patch: one artifact, one rubric, one green check. A job is a chain of those, each depending on the last, with no rubric and no green check until the end. Multiply a 90 percent step nine times and you are below 40. That multiplication is the gap between the asymptote and the floor, and no amount of leaderboard creep closes it, because the leaderboard never measured the chain.
The honest read of June 2026 is this. The models are genuinely excellent at producing one good artifact on demand. They are still, by the most contamination-resistant measures we have, finishing fewer than three in five of the unseen versions of the easiest such task. The 95 is real. It is just answering a question almost nobody is actually asking.



