Every tutorial shows you the happy path: send a prompt, get a completion, parse it. Then you put an agent in production, it makes thousands of calls a day across a dozen tool-steps, and you discover the part nobody demoed — the calls that don't come back. Rate limits. Overloaded providers. A stream that hangs after the first token. The reliability layer between your agent and the model API is not optional plumbing; it's the difference between an agent that runs overnight and one that stalls at 2 a.m. on a 529.

The good news is that the whole problem decomposes cleanly. Almost every failure you'll see falls into one of two buckets, and the entire discipline is about not confusing them.

Retryable vs terminal: the only taxonomy that matters

A failure is either worth retrying or it isn't, and the HTTP status code usually tells you which. Retry the transient ones: 429 (rate limit), 500/502/503 (server errors), Anthropic's 529 (the service itself is overloaded), and client-side timeouts. Never retry the terminal ones: 400 and 422 (your request is malformed), 401/403 (bad key or permission), 404. Retrying a 400 is pure waste — the request is just as wrong the second time, and you've spent an attempt and a slice of your rate limit to learn nothing.

The subtle one is the 429-versus-529 pair, because they look almost identical and mean opposite things. A 429 is your fault — you hit your account's requests-per-minute or tokens-per-minute quota — and it arrives with a Retry-After header (or, on Anthropic, anthropic-ratelimit--reset timestamps) telling you exactly how long to wait. A 529 is the provider's* problem: their capacity, not your quota. The 429 says "slow down"; the 529 says "try again, maybe somewhere else." Handle them the same way at your peril.

A 429 means wait. A 529 means reroute. A 400 means stop. Confusing any two of those is how a retry loop becomes a denial-of-service attack on yourself.

Backoff needs jitter, and retries need idempotency

When you do retry, two details separate a robust client from one that makes things worse.

The first is jitter. Exponential backoff — double the wait after each failure — is standard, but backoff alone has a failure mode AWS documented years ago: when a service blips and a thousand clients all back off by the same schedule, they all wake up and retry at the same instant, a synchronized "thundering herd" that re-triggers the exact overload they were backing off from. Adding randomized jitter to each delay decorrelates the clients and smooths the retry load to a near-constant rate. This is not a micro-optimization; it's the difference between a recovery and a sustained outage.

The second is idempotency. Picture a request that times out on your end after the server already processed it. Your retry logic fires again — and now the model ran twice. For a chat completion that's a wasted dollar; for an agent whose "completion" was a tool call that charged a card or sent an email, it's a duplicated side effect. The fix is an idempotency key the server uses to dedupe. You mostly get this for free: both the OpenAI and Anthropic Python SDKs already generate a per-request key (Idempotency-Key / a stainless-python-retry-{uuid} value) and default to two automatic retries with backoff — but the moment you write your own retry loop around the raw HTTP API, that guarantee is yours to re-create.

Fallback chains: where a 200 lies to you

The natural next move for availability is a fallback chain: if the primary model errors, route to a backup. Gateways like LiteLLM and Portkey make this a config line — an ordered list of models, with fallbacks firing after the retries are exhausted.

Here's the trap, and it's the most important idea in this piece. Those gateways trigger fallbacks on the HTTP status code. A 200 counts as success. So when your frontier model is down and traffic spills to a cheaper, weaker backup, the request returns 200 — and your monitoring stays green — while the output quietly stops conforming to your JSON schema, drops fields your downstream code depends on, or reasons its way to a worse answer. Availability went up; correctness went down; nothing alerted.

The fix is to stop treating 200 as the success signal for a fallback. Gate the fallback on output validity — does the response parse, match the schema, pass the cheap sanity check — not merely on the status line. A fallback model that can't satisfy the contract should count as a failure and cascade to the next option, or surface an error, rather than smuggling a degraded answer through under a green light.

Bound time twice, and break the circuit

Two last pieces. Timeouts for an agent come in pairs: a per-request deadline and a separate time-to-first-token bound on streaming calls, because a stream that connects but never emits will hang your whole ReAct loop indefinitely. The SDK default of 10 minutes, multiplied by two retries across a dozen tool-steps, can strand an agent for the better part of an hour — so wrap the loop in a wall-clock deadline too.

And when a provider is genuinely down, stop knocking. A circuit breaker — Fowler's closed/open/half-open pattern — trips after a run of failures and fails fast in the "open" state instead of paying full backoff on every single step. Pair it with your retries: retries handle the blip, the breaker handles the outage, and your agent degrades on purpose instead of by accident.

None of this is exotic. It's the unglamorous layer that decides whether your agent is something you can leave running — or something you have to babysit.