An agent tops SWE-bench at 88% and the number travels. It lands in the launch post, the comparison table, the procurement deck. What does not travel with it is the price — whether that 88% cost fifty cents a task or fifty dollars. On the leaderboard those two agents occupy the same row. In production they are not the same product; they are not even the same order of magnitude.

This is the quiet defect in almost every agent leaderboard, and it is worth saying plainly because it sounds like a footnote and is actually the whole game: accuracy is not a free axis. It is bought.

Count the reliable ways to move an agent up a benchmark. Sample the model several times and keep the best answer — best-of-n and self-consistency trade tokens for points. Turn up the reasoning budget so the model thinks longer. Add a second agent to critique the first. Retry on failure. Every one of these spends compute to raise the score. So a board that ranks on accuracy alone is not ranking intelligence. It is ranking willingness to spend.

Princeton's AI Agents That Matter made this concrete two years ago and it has only gotten more relevant. The authors showed that embarrassingly simple baselines — just calling the underlying model a few more times — offer Pareto improvements over elaborate agent architectures on HumanEval, matching or beating them while costing far less. Their conclusion is the one most leaderboards still haven't absorbed: useful agent evaluation must control for cost — even if you don't care about cost and only want to find genuinely better designs, because otherwise you cannot tell a real advance from someone spending more money.

An accuracy-only leaderboard doesn't measure how good your agent is. It measures how much you were willing to pay to look good.

The board reshuffles when you draw the second axis#

What happens if you actually plot it? The Holistic Agent Leaderboard (HAL) did exactly that — 21,730 agent rollouts across nine models and nine benchmarks, spanning coding, web navigation, science, and customer service, for about $40,000 in compute. Every run logged accuracy and cost, in dollars and tokens, and the results were charted as accuracy-versus-cost Pareto frontiers.

The headline finding is that the most expensive models are rarely on the frontier. DeepSeek R1 reached it on 0 of 9 benchmarks; a high-reasoning Claude configuration landed there on just 1 of 9, while the unglamorous Gemini 2.0 Flash sat on the frontier in 7 of 9. HAL found a 9× cost difference accompanying a mere two-percentage-point accuracy gap — you could pay nine times as much to be two points better, or read it the other way and call the cheaper agent a steal. And the genuinely uncomfortable result: cranking up reasoning effort reduced accuracy in the majority of runs. Spending more did not even reliably buy the thing it was supposed to buy.

That is the part an accuracy column physically cannot show you: an agent can be strictly dominated — more expensive and less accurate than an alternative — and still print a respectable number that looks fine in isolation.

Why the benchmarks let it hide#

The mechanism is dull, which is why it persists. Most agent benchmarks score binary pass/fail and record nothing about price. An 88% on SWE-bench reached at $50 of inference per task is written down identically to an 88% reached at $0.50. The benchmark is, by construction, blind to a hundredfold difference in operating cost.

The CLEAR framework — from Beyond Accuracy, which scores agents on Cost, Latency, Efficacy, Assurance, and Reliability — put numbers on the blindness: across twelve major benchmarks, cost goes essentially unmeasured despite 50× variation between approaches landing at similar accuracy. Run their evaluation across 300 enterprise tasks and the agents with the highest raw accuracy cost 4.4 to 10.8× more than the Pareto-efficient alternatives that did nearly as well. Their proposed fix, cost-normalized accuracy (CNA), is just accuracy measured against dollars per task — the axis the leaderboard left off. The blunt field version teams already use is even simpler: divide pass-rate by dollars-per-task and rank on that.

Optimize the metric, ship the agent#

Here is the trap that makes this more than an accounting quibble. The metric you optimize against is the agent you eventually ship. Tune for uncapped accuracy and the optimization will happily hand you the $50-per-task agent, because the benchmark told it that money is free. You will have trained your own process to burn cash, then deployed the result.

The fix is not another number on the scoreboard. It is a constraint. Stop asking "which agent is the most accurate?" and ask "which agent is the most accurate under $X per task?" Fix the budget first — the one your unit economics actually allow — and only then read the board. A cost cap turns a marketing leaderboard back into an engineering decision, because now the ranking answers a question you will have to live with.

Doing it yourself is nearly free, which removes the last excuse. Your eval harness already logs token counts; multiply by provider prices to get dollars per run, store that next to each pass/fail, and plot the two. Then choose from the frontier inside your budget rather than from the top of a list. Mind one inflation trap on the way: pass@k quietly spends k attempts to report a single success, padding both the accuracy you celebrate and the bill you don't — which is its own argument for measuring cost in the first place.

None of this makes accuracy unimportant. It makes accuracy incomplete. A leaderboard rank with no price attached isn't an answer; it's the first half of a question — number one at what cost? — and the half they left off is usually the half that decides whether you can afford to run it at all.