Two benchmarks now anchor every "is this coding agent any good" argument, and teams keep reading them as a difficulty ladder: SWE-bench for the easy stuff, Terminal-Bench for the hard stuff. That's the wrong axis. They're not the same test at two difficulties. They measure two genuinely different skills, and the gap between them is the most useful thing either one tells you.

What each one actually puts in front of the agent#

SWE-bench hands the agent a real GitHub issue and a real repository, and asks for a patch. The widely-cited Verified subset is 500 human-validated issues, and the grading is execution-based: the repo's own fail-to-pass tests either go green or they don't. Notice the two things the agent is given for free. First, a pre-specified success oracle — a known set of failing tests that defines exactly what "done" means. Second, a stationary, healthy environment: the repo compiles, the rest of the suite passes, and the agent's job is to edit code without knocking any of that over. It is bounded synthesis against a fixed target the agent never has to discover.

Terminal-Bench — from Stanford researchers and the Laude Institute, with a 2.0/2.1 set of 89 hand-crafted, human-verified tasks (accepted at ICLR 2026) — inverts both of those gifts. Each task drops the agent into a clean Docker container with a goal like "compile this, train that, configure this server, recover this corrupted dataset." It still grades pass@1 by running a hidden pytest suite at the end, and the agent must turn all of it green. But the agent isn't editing a stationary repo. It is operating a live machine it actively mutates — installing dependencies, launching processes, writing files — and the check arrives only after it's done.

The skill SWE-bench never tests#

Strip both down and the difference is about what the agent has to supply itself.

On SWE-bench, the success criterion is provided. The agent can orient around a fixed, externally-defined target and edit toward it. On Terminal-Bench, the agent has to establish its own intermediate success criteria — did the build actually finish, is the server actually up, did that install actually take — and, crucially, detect and dig out of failures it caused: a hung process, an out-of-memory kill, a half-written config, a dependency that broke three steps back. There's no failing test blinking to tell it where it stands. It has to perceive the state of the world, decide whether it's winning, and recover when it isn't.

SWE-bench measures whether an agent can hit a target someone else painted. Terminal-Bench measures whether it can tell, on its own, that it's bleeding — and stop.

Those are different competencies, and a model can be strong at one and weak at the other. A great patch-synthesizer that has internalized millions of diffs against known tests is not automatically a great autonomous operator that has to define "working" for itself and survive its own mistakes. That's the whole reason to run both: the divergence between an agent's two scores is a direct read on how much of its apparent skill was leaning on a pre-specified oracle.

The environment becomes part of the measurement#

Here's the second-order effect, and it's the part that should change how you read the leaderboard. The moment a benchmark scores an agent on operating an environment, the environment stops being a neutral backdrop and becomes part of the instrument. You are no longer measuring only the model.

It shows up immediately in how unstable the numbers are. The same model posts very different Terminal-Bench scores depending on the harness — the scaffold wrapping it. Terminal-Bench ships a deliberately minimal reference agent, Terminus, that gives the model one tool (a terminal) precisely so models can be compared without scaffold advantage. Run a frontier model as a polished, named CLI agent and it can land in the low-to-mid 80s; run that same model through the neutral Terminus harness and it can fall into the 70s. Same weights, very different score — because a chunk of what you were measuring was the agent software, not the model.

And it goes below the harness, all the way to the hardware. Anthropic's engineering team documented that on Terminal-Bench 2.0, the spread between tightly resource-capped and uncapped runs was about 6 percentage pointswider than the gap between the top models on the board — with infrastructure failures like OOM kills and pod crashes silently eating a few percent of tasks outright. Let that land: the benchmark's own noise floor can exceed the differences it's being used to adjudicate. A leaderboard cell that reads 84.0 vs 82.5 may be reporting the test rig, not the contestants.

How to actually read it#

None of this makes Terminal-Bench a bad benchmark. It makes it an honest one about a hard thing — and the instability is the signal, not a flaw to wish away. As of the June 2026 board, the top of Terminal-Bench 2.1 is a tight pack of frontier agents trading the lead in the high 80s, with independent reruns landing several points lower under different harnesses. The useful way to consume that is not to crown the top row.

If you're choosing a coding agent, treat the two benchmarks as the two halves of the job they actually are: SWE-bench Verified tells you about patch competence against a known target, and like the other agent benchmarks it's saturating and contamination-prone, so read it with suspicion. Terminal-Bench tells you about operating a live system end-to-end — but only if you pin the harness and the resource budget, because otherwise you're benchmarking the rig as much as the model. And whichever number you quote, quote the harness and the date alongside it. On a benchmark where the environment is part of the measurement, a score without its conditions isn't a result. It's a screenshot.