The old way to lie with a launch benchmark was to pick the one public test your model happened to win and leave the rest in a footnote. That was a solvable problem. You could go to the neutral leaderboard, find the same benchmark, and watch the number shrink. There was a denominator. In 2026, the denominator is quietly disappearing, and that is the part worth paying attention to when you search self-reported LLM benchmarks and wonder whether any of it replicates.
Look at what shipped in June. MiniMax M3 arrived on the first of the month with SWE-Bench Pro at 59.0%, Terminal-Bench 2.1 at 66.0%, and MCP-Atlas at 74.2% — every one of them run on MiniMax's own infrastructure, with the company's own agent scaffolding, and reported at a moment when the open weights and the technical report were still "about ten days out" on Hugging Face. The parameter count was undisclosed at launch. So the headline was: an open-weight model beats GPT-5.5 on coding, except you could not download it, could not see how big it was, and could not run the harness that produced the score.
Eleven days later, Kimi K2.7-Code did something structurally similar and more complete. Moonshot reported +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite over the prior model. Read that list again. Kimi Code Bench, Program Bench, MLS Bench — those are all Moonshot's benchmarks. As of the release, there was no independent SWE-bench Verified result, no Terminal-Bench result, no LiveCodeBench. The model may well be excellent; that is not the issue. The issue is that Moonshot is both the examiner and the only party who has seen the exam.
The failure mode moved#
Here is the non-obvious part. The community spent years learning to catch cherry-picking, and got good at it. That skill is now aimed at the wrong target. Cherry-picking assumes a shared, public test the reader can also reach. What the June launches show is a migration away from shared tests entirely — toward proprietary suites where the vendor owns the tasks, owns the grader, and owns the harness. You cannot cherry-pick-check a number that exists on exactly one leaderboard in the world, and that leaderboard is the vendor's slide deck.
A cherry-picked public benchmark can be re-run against you. A private benchmark with no external denominator cannot be re-run at all — there is nothing to replicate, only something to believe.
This matters more in agentic coding than anywhere else, because the harness is not a detail — it is most of the score. Independent audits this year put the swing from scaffolding alone at 10 to 20 percentage points on identical weights. The cleanest illustration is a single model measured two ways: Claude Opus 4.5 posts 80.9% on SWE-bench Verified and 45.9% on the harder, contamination-resistant SWE-bench Pro — a 35-point collapse without touching a weight. If you missed why those two numbers diverge so hard, the split between the Verified and Pro task sets is the whole story. When a vendor reports a proprietary number and does not disclose the scaffold, you are not reading a model result. You are reading a model-plus-unknown-plus-vendor-infra result, rounded to one decimal for the press.
The five-point read#
None of this requires you to distrust a number. It requires you to know what kind of number it is. Before you repost a launch chart, run it through this:
- Is it on a neutral third-party leaderboard? Or only the vendor's own suite? Artificial Analysis runs all nine of its Intelligence Index evaluations itself; the public SWE-bench Verified split, LMArena, and Terminal-Bench are reproducible by outsiders. Kimi Code Bench and MCP Atlas are not. If the only source is the launch post, the number is a claim, not a result.
- Is the eval harness disclosed? Prompt, scaffold, retries, best-of-N, validation loops. Ninety-nine of the hundred entries on the SWE-bench leaderboard are self-reported; the harness is where the inflation hides, and 10 to 20 points of it can hide in there quietly.
- Contamination — was the benchmark public before the training cutoff? A test that predates the weights is a test the model may have read. This is exactly why CAISI, evaluating DeepSeek V4 Pro, leaned on held-out sets like its internal PortBench and the ARC-AGI-2 semi-private split, and still put the model roughly eight months behind the frontier.
- Apples-to-apples? Same pass@k, same tool access, same context budget. A pass@1 for you against an implied best-of-many for them is not a comparison, it is a category error.
- Who ran it, on whose infrastructure? Vendor infra plus vendor scaffold plus vendor grader is three degrees of home advantage. Independent evaluators — Artificial Analysis, NIST's CAISI — exist precisely to remove them.
Self-reported does not mean false. It means unaudited. The correct posture toward an unaudited number is not disbelief; it is a hold, pending replication.
The uncomfortable truth is that the neutral denominators still work when you use them. Artificial Analysis had Claude Fable 5 at 60 on its index, Opus 4.8 at 56, GPT-5.5 at 55 — one methodology, every model run the same way. The wider erosion of what a leaderboard even certifies is a longer argument, one I've made about how uncertainty swallowed the rankings and about reading this specific crop of open-weight launches. For now, the shorter version fits on an index card: when the vendor scored its own exam, the score is not the finding. The absence of anyone else's score is.



