The Coin-Flip Horizon

There is a number that moves through the discourse every few months, and it is always rising. Frontier models can now complete tasks that take a human an hour. Now two hours. Now most of a workday. It is cited as a milestone, an extrapolation line, sometimes a warning. It is a real number, measured carefully by serious people. And almost no one repeating it mentions the part that matters: it is measured at a 50% success rate.

Fifty percent. A coin flip. The headline horizon is the length of task the model gets right half the time.

The advertised number is the model's best day. Production runs on its average day, and the two are not close.

What the curve actually says

The cleanest measurements come from METR, which times humans on a battery of software tasks and then asks how long a task can be before a model's success rate falls below a threshold. Plotted over the last six years, the 50% horizon doubles roughly every seven months — the exponential that launched a thousand forecasts.

But METR measured a second curve, and this is the one to keep. Hold the model to an 80% success rate instead of 50%, and the task length it can handle drops by a factor of four to six. For one frontier model, the published figures are stark: about 59 minutes of work at 50% reliability, about 15 minutes at 80%. Same model, same week. The only thing that changed was how often you required it to be right.

The two curves climb at nearly the same rate — the 80% horizon doubles about every 204 days, the 50% horizon about every 207 — which sounds like good news until you sit with it. Parallel lines don't converge. The ratio between "can sometimes do" and "can reliably do" is not closing. It is a roughly constant multiple, riding up the chart in lockstep. METR's own framing: what a model can do at 50% today, it will do at 80% in about fourteen months. Reliability is not a property the next checkpoint grants you. It is a tax of roughly two doubling-times, paid on every capability, forever.

Why length is the enemy

There is a mechanism under this, and it is unglamorous. A recent analysis models agent success as something like radioactive decay: a roughly constant hazard of failure per unit of task, so the probability of finishing intact falls exponentially with length. Tasks don't get hard at the end. They accumulate exposure. Every additional step is another sentence in which the agent can misread state, hallucinate a file path, or confidently take the wrong branch — and on a long enough chain, something will.

You can do the arithmetic on a napkin. Suppose an agent is 85% reliable on a single step — a number most teams would kill for. Over a ten-step workflow, end-to-end success is 0.85¹⁰ ≈ 0.20. Chain three agents that are each 70% reliable and you get 0.70³ ≈ 0.34. The per-step numbers look like a working product. The product of the per-step numbers looks like a coin you'd be a fool to flip. This is why the demo dazzles and the deployment disappoints: the demo is one step, and one step is where models are strongest.

What this does to the roadmap

None of this says the systems don't work. They work — at the reliability that the work tolerates. The useful question is never "how long a task can the model do?" It is "how long a task can it do at the reliability my process can absorb?" — and the honest answer is usually a small fraction of the number in the headline.

It also reframes what "progress" buys you. A year of frontier gains genuinely doubles the 50% horizon. It does not close the gap to 80%; it drags the whole staircase up while keeping the steps the same height. So if your application needs four-nines reliability — most things that touch money, code in production, or a customer do — the model's capability curve is not the line you should be watching. You should be watching the reliability curve, four-to-six-times to the right of it, climbing at the same speed and no faster.

The industry quotes the 50% number because it is the flattering one, the one that makes the line go up fastest. Fair enough; everyone reads off their best day. Just remember, when you see the next "AI can now do an N-hour task," to silently append the clause the slide left off: half the time. The other half is where the work is.

What the curve actually says

Why length is the enemy

What this does to the roadmap

Priya Sundaram

Dispatches from the machines, in your inbox

The Coin-Flip Horizon

What the curve actually says

Why length is the enemy

What this does to the roadmap

Priya Sundaram

Continue reading

The Price Fell. The Bill Rose. Both Numbers Are True.

The Megawatt You Cannot Rent

The Duty of Care Died Before Anyone Had to Meet It

Dispatches from the machines, in your inbox