Agents' Last Exam: Frontier Agents Pass 2.6% of Hard Professional Work

The number making the rounds is 97.4%. On the hardest tier of Berkeley RDI's Agents' Last Exam (ALE) — 1,490 task instances across 55 professional subdomains, built with 250-plus industry experts — mainstream agent harnesses on frontier model backbones post an average full-pass rate of 2.6%. Every frontier agent they tested, Fable 5 included, scored 0% on the top tier. The paper went up June 3 and was the most-discussed paper of its week.

It's easy to read that as "AI agents are fake." That's the wrong lesson, and it wastes the actually useful thing the benchmark did. The useful thing isn't the score. It's what the score is measuring.

Full-pass, all-or-nothing, like a client#

Most agent benchmarks you've seen quoted are graded on partial credit. Step accuracy: what fraction of the tool calls were right. Sub-task completion: how many of the checklist items got ticked. Trajectory scoring: did the agent generally head the right direction. On those metrics, frontier agents look genuinely strong — high 80s, low 90s, the numbers in the keynote slides.

ALE grades the deliverable. Whole thing, end to end, all-or-nothing — a task passes only if the final output is something a professional could actually hand over. A financial model with one transposed formula fails. A migration that moves 19 of 20 tables fails. A brief that's correct except for the citation that doesn't exist fails.

Partial credit measures how close the agent got. Full-pass measures whether you could send it. Autonomy only cares about the second one.

That is the entire distance between 92% and 2.6%. It's not that ALE uses harder tasks (though it does) — it's that ALE refuses to round up. And rounding up is exactly what the last mile of real work won't let you do. Nobody pays for a report that's 90% deliverable; they pay for one they can send without reading every line. The metric that forgives the last 10% is the metric that hides the only failure that matters.

The benchmarks you trusted were gameable#

ALE didn't arrive in a vacuum. The same Berkeley group first went and broke eight widely cited agent benchmarks — six of which they drove to 100% without their agent solving a single underlying task, by exploiting scoring shortcuts and leaky evaluators. That's the setup and ALE is the punchline: if the existing scoreboards can be maxed out without doing the work, then the high numbers on the slides were never evidence of capability. They were evidence of a permissive grader.

This reframes the 2.6% from a doom stat into a calibration. It's not saying agents do nothing. Plenty of ALE's easier tiers pass at respectable rates, and agents demonstrably do useful narrow work in production every day. It's saying that expert-grade, end-to-end, no-human-in-the-loop deliverables remain rare — and that the benchmarks implying otherwise were grading on a curve.

Why this collides with the spending#

Here's the part that should make a buyer uncomfortable. Gartner is forecasting purpose-built AI agent software to hit roughly $206 billion in 2026, up about 139% year over year. That spend is being justified with capability numbers — and the capability numbers in the sales deck are, disproportionately, the partial-credit kind. VentureBeat separately reported frontier models failing something like one in three production attempts and getting harder to audit as they do. The market is underwriting the 92%. ALE is a bet that the number that governs whether you can remove the human is the 2.6%.

The practical move isn't cynicism, it's a procurement discipline: when a vendor quotes an agent benchmark, ask one question — is a task marked passed only if the full deliverable is correct, or does it get partial credit for the steps? If it's partial credit, the number tells you the agent is a strong assistant. It does not tell you it's a replacement, and you should not price it like one.

The honest read#

ALE's 2.6% is not a verdict that agents fail. It's a verdict that we have been measuring the wrong thing to justify the biggest spending line in enterprise software. An agent that gets 92% of the way to a deliverable and a human who closes the last 8% is a real, valuable, shipping product — and it is a completely different purchase than the autonomous professional the budgets are being written for. Berkeley's contribution is a benchmark honest enough to tell those two apart. The uncomfortable follow-on: almost none of the others are, and that was the point of breaking the eight first.

Agents' Last Exam: Frontier Agents Pass 2.6% of Hard Professional Work — and the 2.6% Is the Point

Full-pass, all-or-nothing, like a client#

The benchmarks you trusted were gameable#

Why this collides with the spending#

The honest read#

Soren Vey

Dispatches from the machines, in your inbox

Agents' Last Exam: Frontier Agents Pass 2.6% of Hard Professional Work — and the 2.6% Is the Point

Full-pass, all-or-nothing, like a client#

The benchmarks you trusted were gameable#

Why this collides with the spending#

The honest read#

Soren Vey

Continue reading

Microsoft Agent Framework's CodeAct: When the Sandbox Stops Being the Hard Part

How to Put a Hard Spending Cap on an AI Agent

Do AI Agents Self-Correct? Why Reflexion Works and 'Check Your Work' Backfires

Dispatches from the machines, in your inbox