Here is the question every engineering lead is actually asking, phrased honestly: will this coding agent close my tickets without breaking things, and what will it cost me? And here is the question the public leaderboards answer: which frontier model resolves the most GitHub issues in twelve open-source Python repos? Those are not the same question. For most of 2025 we pretended they were, and the pretense collapsed this year.

A leaderboard score tells you how an agent does on someone else's code. It says almost nothing about how it will do on yours — and the higher the score, the less it says.

Why the public benchmarks mislead#

Three things went wrong at once, and they compound.

Saturation. SWE-bench Verified — the human-validated 500-task subset that became the default coding scoreboard — is topped out. As of late June 2026 the leading entries cluster in the high 80s and 90s, with Claude Opus 4.8 at 88.6%. When the frontier is bunched inside a few points, the ranking is measuring noise, harness luck, and eval quirks, not a difference you will feel.

Contamination. This is the fatal one. SWE-bench's tasks are drawn from real, public GitHub issues and their merged pull requests. Public data is training data. On 23 February 2026 OpenAI stopped reporting against SWE-bench Verified entirely, citing training-data contamination across every frontier model, defective tests (it found roughly 59% of failed test cases were themselves flawed), and saturation (OpenAI introduced the set; its own analysis later retired it). A model can score high because it has seen the fix, not because it can find one. Your private codebase offers no such memory to lean on.

Wrong shape. Even uncontaminated, these benchmarks test a one-shot patch against a curated issue with a ready-made test. That is not your workflow. Your workflow is an underspecified ticket, a codebase with local conventions the agent has never seen, existing tests that must keep passing, and a human who has to read and approve the diff. The benchmark measures the easy 20% of the job.

The gap is not subtle. The same model generation that scores near 88% on Verified scores around 23% on the harder SWE-bench Pro. If a single benchmark swing can erase two-thirds of a model's apparent competence, no public number is a promise about your repo.

What the benchmarks actually measure — and the hole they leave#

Read them for what they are, not what the marketing implies.

Each is useful as a coarse floor filter. None of them is your codebase. That is the hole, and only you can fill it.

The recipe: a private, held-out eval from your own repo#

The evaluation that predicts your outcome is one you build. It is more work than reading a chart, and it is the only work that counts.

1. Harvest tasks from your own recently-closed issues and PRs. Take issues that were closed by a merged fix in the last few months. The issue text is the prompt. The merged PR is the reference solution you hide. Because these come from your repo and your recent history, the agent cannot have trained on the resolution — this is your contamination defense, the same instinct behind Pro's held-out split, applied to the only repo you care about. If you have never built a labeled eval before, the mechanics carry over directly from how to build an LLM eval dataset.

2. Define the task and the oracle. Give the agent the repo state before the fix and the issue description. The oracle — the automatic grader — is your own test suite: the tests that shipped with the real PR, plus the existing suite that must stay green. Hidden tests as ground truth is exactly how the public benchmarks verify; the difference is that here the tests are yours, so passing them means the agent did your job.

3. Pick metrics that map to money and risk, not vanity.

4. Evaluate the (harness + model) pair — always. This is the least intuitive and most load-bearing rule. The scaffold around a model — prompt construction, tool set, output parsing, retries, context management — moves scores far more than the model does at the frontier. The same weights run through different frameworks span roughly 42% to 78% on public coding benchmarks, while swapping among the best frontier models moves under a point (Particula). A model score with no harness attached is not a measurement. The reasoning generalizes to any agent — it is the same lesson as evaluating an AI agent's tool use, where the scaffold, not the model, decides whether the right tool gets called.

5. Watch for flakiness and nondeterminism. Run each task more than once. If a "pass" flips to "fail" across identical runs, your oracle has flaky tests or the agent is nondeterministic — either way your resolve rate has an error bar, and you should report it as one rather than pretend the point estimate is truth.

6. Keep the set fresh. Your private eval is contamination-resistant only until its answers stop being secret. The moment you paste failing cases into prompts, or a task's fix ages into the next training snapshot, it decays. Rotate in newly-closed issues each quarter and retire the stale ones. An eval is a perishable good.

What "good" looks like — and the pitfalls#

Good is boring and specific: a stable pass@1 on your tasks, a regression rate near zero, a cost-per-solved-task you would sign off on, and a review time that is a fraction of the manual fix. Report all four together. A single headline number is how you got misled in the first place.

The pitfalls are predictable, so name them before they bite:

The frontier models are close enough now that the choice is rarely "which model" — it is which model in which harness, on your code, at what cost. For the current field and how the leaders actually stack up, the sibling piece on GPT-5.5 vs Claude Opus 4.8 vs Gemini for coding is the map; a private eval is the territory. The leaderboards were never going to answer your question. They were answering theirs.