For two years, the way you bragged about a coding model was a single number: its score on SWE-bench Verified, the 500 human-checked GitHub issues that became the industry's standard exam. The frontier labs crossed 70% on it and kept climbing. Then Scale AI released a successor, ran the same models, and the number fell off a cliff. On the public set of SWE-bench Pro, GPT-5 scores 23.3% and Claude Opus 4.1 scores 23.1% — against 70%+ on the test they'd been acing.
The obvious reading is that Pro is just harder. That's true, but it's the least interesting thing about it. The drop is better understood as a measurement: it's roughly the size of the part of the old score that was never measuring coding ability in the first place.
What the old number was actually counting#
A SWE-bench Verified score folds together at least three things you'd want to keep separate, and reports them as one.
The first is memorization. The benchmark is built from public GitHub repositories, and the "gold" patch that solves each issue is sitting right there in the commit history. Any model trained on GitHub after the dataset shipped has plausibly read the answers. This isn't a hypothetical: the 2025 paper The SWE-Bench Illusion found that models could name the exact buggy file paths and functions from the issue text alone, with no access to the repository — instance-level verbatim recall ranging from about 11.7% on the weakest model to 31.6% on the strongest, and rising monotonically across the Claude generations. A separate December 2025 study put the same models on the real benchmark and on a matched set of fresh, non-benchmark repositories; they were several times better at finding the edited files on SWE-bench. That asymmetry is the fingerprint of memory, not reasoning.
The second is broken tests. When OpenAI stopped reporting SWE-bench Verified in February 2026, it audited the hardest problems its own model kept failing across many runs. Of that audited subset, more than 59% turned out to have flawed test cases — checks so narrow or so wide that they rejected correct fixes or accepted wrong ones. Some of those "failures" were the model being right and the grader being broken.
The third is scaffolding. The same model, wrapped in a more aggressive agent harness that retries, explores files, and runs the tests in a loop, can pick up double-digit points over its bare-metal score. At that point you are partly benchmarking the rig, not the model.
Three different leaks, one number, no way to tell them apart. A 72% on SWE-bench Verified in 2026 is not a lie, exactly. It's just uninterpretable.
You can't memorize a test you were never allowed to see. That sentence is the entire design of SWE-bench Pro.
The fix is a license, not a puzzle#
Here is the genuinely non-obvious move. The standard instinct for a saturated benchmark is to write harder problems. Scale did make the tasks bigger — SWE-bench Pro spans Python, Go, JavaScript and TypeScript (Verified is Python-only), and its tasks are multi-file, with reference patches averaging around 107 lines across four files. But difficulty isn't what defends a benchmark against contamination. Difficulty just buys time until the harder problems leak too.
What actually defends it is un-memorizability, and the instrument Scale reached for is a copyright license. The public and held-out sets of SWE-bench Pro are drawn only from repositories under strong copyleft terms — GPL and AGPL. The reasoning is adversarial: copyleft is a legal deterrent against that code being folded into a commercial training corpus, because training on it arguably propagates the license. The benchmark weaponizes the one thing a lab's lawyers won't let it ignore. On top of that, Pro keeps 858 held-out tasks whose solutions are never published and 276 tasks from private startup codebases that Scale runs on the model's behalf and never releases. The set you'd most want to train on is the one you can never see.
This reframes what a benchmark is. Its value was never its difficulty; it was its half-life — how long until the answers are in everyone's training data. Measured that way, the most important property of an eval isn't the cleverness of its problems. It's whether you can keep them secret. And the most effective secrecy mechanism anyone has found so far is not cryptography or a harder puzzle. It's the GNU General Public License.
The arms race just moved up a floor#
Don't mistake this for a solved problem. It isn't — it's a relocated one. SWE-bench Pro scores have already split into two kinds. There is the standardized leaderboard, where Scale runs every model through identical scaffolding to isolate capability — and where the leader as of June 18, 2026 is GPT-5.4 (xHigh) at 59.1%. And there are the vendor-reported numbers, run on each lab's own harness, which sit higher — Claude Opus 4.8 at 69.2%. Same benchmark, ten-point gap, different rigs. The harness-inflation problem the old benchmark suffered from didn't die; it climbed one level up, from the questions to the apparatus.
Which is the practical lesson for anyone choosing a model. A coding-agent score is not a scalar; it's a tuple of (benchmark, harness, date), and a number quoted without the other three is closer to marketing than to evidence. Prefer evals with a held-out component — the decontaminated, continuously-refreshed pipelines like SWE-rebench exist precisely because a static public test has a shelf life. Read leaderboards the way you'd read confidence intervals rather than ranks, and lean on online over offline evaluation where you can, because production is the one environment nobody gets to train on in advance.
SWE-bench Verified didn't stop being useful because the models got too good for it. It stopped being useful because it became impossible to tell what its number meant. SWE-bench Pro's answer — keep the test where the training data can't reach — is the right one. It's also temporary. The only benchmark that stays honest is the one you haven't published yet.



