For a year the frontier of AI coding was a thing that sat next to you. Autocomplete, then chat, then an "agent mode" in your editor that you watched work, hand on the approve button. The interesting shift of 2025 was that the agent got up and left the room. You give it a task, it goes off to its own machine, and it comes back — minutes or an hour later — with a pull request. That is a background coding agent, and between mid-2025 and 2026 every major player shipped one.
The names you're choosing between are Cognition's Devin, OpenAI's Codex, Cursor's background agents, Google's Jules, and GitHub's Copilot coding agent — with Anthropic's Claude Code on the web and the open-source OpenHands rounding out the field. The temptation is to rank them by coding skill. Resist it. The more useful observation is how little they actually differ where it counts.
They have all converged on the same machine
Read the architecture pages side by side and they blur together. Each agent boots a fresh, isolated cloud VM, clones your repo at a chosen branch, installs dependencies, edits files, runs the test suite, iterates on failures, and opens a pull request. Codex runs it in an OpenAI-managed container that reads your AGENTS.md for the lint and test commands; Devin calls its VM a "Devbox"; GitHub's runs on Actions infrastructure; Claude Code on the web routes git through a proxy so "your token never enters the container." Different logos, same loop.
The genuine differences live at the edges of that loop. Network posture: Codex gives the setup phase internet access and then runs the agent offline by default; GitHub's restricts egress and scopes repo permissions tightly. Where the VM lives: Cursor will now run agents self-hosted in your own VPC, which is a real enterprise edge nobody else quite matches. And how you invoke it, which turns out to be the most honest expression of each product's worldview.
The coding ability is the swappable frontier model. The product is the harness around it — where it runs, what it can touch, and how it hands the work back.
Invocation is the personality
Devin and Cursor live in Slack: you @-mention the agent in a thread, it reads the conversation, and a PR appears. Codex lives in ChatGPT and a Rust-based open-source CLI. Jules lives on the web and wires into GitHub, with a CLI and API added later. GitHub's agent has the most opinionated entry point of all: you assign it a GitHub Issue, exactly as you'd assign a junior engineer, and it inherits the Issue, the branch protections, the Actions checks, and the audit log automatically.
All of them run multiple tasks in parallel — this is the "manage a fleet of agents" pattern that the whole category is converging on, and Devin productized most directly with MultiDevin, a manager agent that fans a backlog out to as many as ten workers and merges the result. The role being described isn't "programmer" anymore. It's a supervisor of programmers.
Why the SWE-bench number won't decide this for you
Here is the part that should change how you shop. You will reach for SWE-bench Verified to break the tie, and you will mostly find that the number doesn't exist for the thing you're buying. Cursor's background agents, GitHub's coding agent, and the current Jules publish no official Verified score for the agent itself — because the agent is a harness, and the score belongs to whatever frontier model you plug into it. The figures in circulation are a minefield: codex-1's headline 72.1% is a model score (pass@1, and with 23 unrunnable problems excluded; a third-party retest landed nearer 69%), Devin 2.0's 45.8% is self-reported, and GitHub's widely-cited 56% is for its in-IDE agent mode — a different product. Even the honest numbers conflate pass@1 with best-of-N.
So the leaderboard tells you about the model, which you can often change, and almost nothing about the harness, which you can't. If you want to understand what these agents are actually good and bad at, the benchmark literacy that matters is knowing which surface a score came from.
The bottleneck moved, and that's the real decision
The deepest reason to ignore the coding-skill arms race is that coding skill stopped being the binding constraint. Andrej Karpathy's framing is the load-bearing one: an agent can generate an enormous change in seconds, but a human still has to verify it, so the limiting factor becomes the speed of the generation–verification loop. Point five parallel agents at your backlog and you don't have a coding problem anymore. You have a review problem — a queue of plausible PRs arriving faster than anyone can trust them.
That reframes the whole comparison. The best background agent for you is the one whose output your team can actually verify and keep on a leash: small reviewable diffs, legible test evidence, and governance you already trust. It's why GitHub's "it's just an Issue and a PR with branch protections" pitch is stronger than its benchmark suggests, why Codex's habit of surfacing terminal logs and citations for each step matters, and why nothing reputable auto-merges to main.
The cautionary tale stays relevant: Answer.AI's month with Devin in early 2025 saw it complete just three of around twenty real tasks. The agents are far better now, but the lesson holds — the demo is cheap and the verification is expensive. Choose the agent that makes the expensive part faster. The one that writes the most confident wrong code fastest is not a bargain.



