A team I talked to last quarter shipped an agent that, by every per-call metric, was healthy. No request ever timed out. No call errored. And yet a fraction of runs took fourteen minutes to return an answer a user expected in thirty seconds — long enough that the user had closed the tab, long enough that a second, retried run was already underway against the same half-finished work. When they pulled the traces, nothing was broken. The agent had simply taken eleven steps, each a perfectly reasonable call comfortably under the SDK's timeout, and the steps had added up to a number no one had ever set a limit on.

This is the gap. You set a timeout on the model call because the SDK made it easy, and you assumed you had bounded the agent. You hadn't. You bounded one link in a chain whose length the model decides at runtime.

A timeout bounds a call; an agent is a loop#

The default you inherit is per request. The OpenAI Python SDK times out a single call after ten minutes unless you say otherwise; the Anthropic SDK does the same, and its TypeScript client will, for a large non-streaming max_tokens, compute the timeout dynamically and stretch it toward a full hour. These are sane numbers for one request. They are nonsense as a bound on an agent, because an agent is not one request. It is a loop that plans, calls a tool, reads the result, calls the model again, and decides — itself — how many more times to go around.

Six steps that each finish in two minutes each pass a ten-minute check and the run still took twelve. The per-call timeout cannot see the loop. And the loop is exactly where the latency lives: an agent's wall-clock is a serial chain of round-trips, not the speed of any single token. Worse, the one control you did add can make it unbounded. A naive retry resets the clock — every retry gets a fresh full timeout — so a couple of stuck steps that each retry twice can run, in practice, with no ceiling at all.

A deadline is not a timeout#

Distributed systems solved this a decade ago, and the fix has a different name on purpose. Not a timeout — a deadline.

The distinction is exact, and gRPC states it plainly: a timeout is the maximum duration one call may take, while a deadline is "a point in time which the call should not go past." The magic is what happens when a deadline propagates. gRPC converts it to a per-hop timeout "from which the already elapsed time is already deducted." So a call handed a 30-second deadline that spends 7 seconds before fanning out to its own dependencies hands them a 23-second deadline, and the one after that gets 19. The budget shrinks as it's spent. Nobody downstream can promise time the run no longer has.

Every serious concurrency runtime gives you this primitive. In Go, a context created with WithDeadline produces derived contexts "canceled no later than the parent," and cancelling one cancels everything beneath it. In Python, asyncio.timeout() wraps a scope of work — not a single call — and cancels the whole task inside it when the clock runs out. In JavaScript, AbortSignal.timeout() makes a run-long deadline and AbortSignal.any() composes it with each attempt's own signal, so a single expiry tears down whatever is in flight.

A timeout asks "is this call taking too long?" A deadline asks "does the run still have time for this call at all?" Only the second question has an answer an agent can act on.

The payoff is also a cost ceiling. The Google SRE book's chapter on cascading failures is blunt about why: when four stacked layers each retry three times, one user action becomes 4³ — sixty-four — attempts on the backend. A retry that fires without checking the remaining deadline is precisely that multiplier, which is why backpressure for agents is upstream flow control, not politer backoff. Check the budget before the retry: if what's left can't fit another attempt, don't start one.

Cancelling is neither free nor clean#

Here is the part the AbortController tutorials skip. Enforcing a deadline means cancelling work, and cancellation has two bills attached.

The first is literal. Killing the request does not refund the tokens. By OpenRouter's provider documentation, aborting helps only when you're streaming to a provider that supports it — OpenAI and Anthropic do — and even then you pay for every token generated before the abort. For a non-streaming request, or a provider that doesn't support cancellation, the model finishes the entire response server-side and bills it in full, your dead connection notwithstanding. This is the case for streaming by default and cancelling at the first opportunity: the abort caps the spend, it never reverses it.

The second bill is worse because it's silent. A deadline that fires in the middle of a tool call can leave that tool half-applied — the charge captured but the order not recorded, the file written but the index not updated. Cancellation is not a clean stop; it's a stop at an arbitrary point you didn't choose. Which means a deadline system is only as safe as the steps it interrupts. Side-effecting tools need to be idempotent or compensatable — a stable key so the half-done write can be retried or rolled back — for exactly the same reason a retry that resent an email needed one. A timeout you can't clean up after is just a different way to corrupt state.

What to actually build#

One deadline per run, computed once at the top from the wall-clock budget you're willing to spend. Thread it through every step as an absolute time, not a duration, so each call subtracts the elapsed time itself — AbortSignal.timeout plus AbortSignal.any in JS, a deadline-bearing context in Go, an asyncio.timeout scope in Python. Before each model call and each retry, ask whether the remaining budget can cover it; if not, stop now and return what you have. Stream, so a cancellation actually stops the meter. Pair the whole thing with a step counter for the loops a clock can't catch and a cost breaker for the spend a clock won't.

The deliverable isn't a smaller timeout= value. It's a run that, when it runs out of time, knows it has run out of time — and degrades on a budget you chose, instead of hanging until something further down the stack gives up for it.