The reliability conversation about agents is stuck on the wrong layer. Most of it is about the model: better prompting, more eval coverage, a smarter checkpoint next quarter. All worth doing, all subject to a hard ceiling — a long task is a chain of steps, and per-step error compounds whether you like it or not. Ten steps at 85% each is about 20% end to end. You cannot prompt your way out of multiplication.
So stop trying to make the step reliable and make the retry free. That is a solved problem in distributed systems, and it predates LLMs by a decade. It's called durable execution: the engine persists the result of every completed step, so when step seven blows up, the workflow resumes from step seven — not step one — with steps one through six already done and never re-run. The agent stays stochastic. The orchestration around it stops being.
The model's job is to be occasionally brilliant. The workflow's job is to make sure occasional is enough.
The move: checkpoint the steps, make them idempotent
The core idea is small. You write your agent loop as ordinary code, but every side-effecting step — an API call, a tool invocation, a model completion — is wrapped so its result is journaled the first time it succeeds. Crash, timeout, transient 500, hit-the-rate-limit, the process dies and respawns: on replay, journaled steps return their saved value instead of executing again. The booking isn't made twice. The email isn't sent twice. The expensive completion isn't paid for twice. Failure stops being catastrophic and becomes a resume.
That last clause is the whole game. An agent that fails 15% of the time per step but resumes for free is, from the outside, an agent that always finishes — it just sometimes takes a few extra tries on a few steps. You have traded a reliability problem you can't solve (the model) for an infrastructure problem you can (durable state). Four projects, four bets on how heavy that infrastructure should be:
step.run is memoized and retried independently, with built-in concurrency limits, fan-out, and a dev UI that shows exactly which step replayed. The most ergonomic on-ramp if your stack is already event-driven.Where the abstraction leaks
Two honest caveats, because this is The Stack and not a sales deck.
First, idempotency is on you. The engine guarantees a step's recorded result replays exactly once; it cannot guarantee the world outside agreed. If your "send refund" step succeeds, then the process dies before the result is journaled, a naive retry double-refunds. You need idempotency keys on the external effects, and that discipline doesn't come from the library — it comes from you thinking about every side effect the agent can have. Durable execution makes correctness possible, not automatic.
Second, non-determinism fights replay. Deterministic-replay engines (Temporal's model) require the workflow body to be reproducible: same inputs, same path. An LLM call is the opposite of deterministic, which is exactly why you push it to the edge — wrap the completion as a journaled step whose output is recorded, so replay reads the saved tokens instead of re-sampling. Keep the stochastic thing at the boundary and the deterministic skeleton in the middle. Get that inversion wrong and your "durable" workflow takes a different branch on every recovery.
The reason this matters now is the same reason the capability curves are flattering: models are improving fastest at exactly the short, single-step horizon where they were already good, and reliability over long horizons is lagging structurally. Waiting for the model to close that gap is a roadmap. Wrapping the model so its failures are cheap is a Tuesday. One of those ships.



