The Agent Forgets, the Workflow Remembers

The reliability conversation about agents is stuck on the wrong layer. Most of it is about the model: better prompting, more eval coverage, a smarter checkpoint next quarter. All worth doing, all subject to a hard ceiling — a long task is a chain of steps, and per-step error compounds whether you like it or not. Ten steps at 85% each is about 20% end to end. You cannot prompt your way out of multiplication.

So stop trying to make the step reliable and make the retry free. That is a solved problem in distributed systems, and it predates LLMs by a decade. It's called durable execution: the engine persists the result of every completed step, so when step seven blows up, the workflow resumes from step seven — not step one — with steps one through six already done and never re-run. The agent stays stochastic. The orchestration around it stops being.

The model's job is to be occasionally brilliant. The workflow's job is to make sure occasional is enough.

The move: checkpoint the steps, make them idempotent

The core idea is small. You write your agent loop as ordinary code, but every side-effecting step — an API call, a tool invocation, a model completion — is wrapped so its result is journaled the first time it succeeds. Crash, timeout, transient 500, hit-the-rate-limit, the process dies and respawns: on replay, journaled steps return their saved value instead of executing again. The booking isn't made twice. The email isn't sent twice. The expensive completion isn't paid for twice. Failure stops being catastrophic and becomes a resume.

That last clause is the whole game. An agent that fails 15% of the time per step but resumes for free is, from the outside, an agent that always finishes — it just sometimes takes a few extra tries on a few steps. You have traded a reliability problem you can't solve (the model) for an infrastructure problem you can (durable state). Four projects, four bets on how heavy that infrastructure should be:

▟ temporalio/temporal

The heavyweight: workflows-as-code with deterministic replay across many languages. Your agent loop becomes a durable function; the cluster guarantees it runs to completion exactly once even across worker crashes. Operationally serious — you run a cluster — and the standard the others measure against.

★ 21kGotemporalio/temporal

▟ restatedev/restate

The single-binary answer. Durable functions, durable promises, and "virtual objects" for per-key state, in one Rust process with a much lighter footprint than a Temporal cluster. Good fit when you want durable execution without standing up orchestration infra you'll have to babysit.

★ 4kRustrestatedev/restate

▟ inngest/inngest

Step functions for the serverless-and-events world. You define an agent workflow as event-triggered steps; each step.run is memoized and retried independently, with built-in concurrency limits, fan-out, and a dev UI that shows exactly which step replayed. The most ergonomic on-ramp if your stack is already event-driven.

★ 5.5kGoinngest/inngest

▟ dbos-inc/dbos-transact-py

The minimalist take: durable workflows as a library on top of Postgres, no separate service. Decorate a function and its steps checkpoint into your existing database; recovery is a query, not a cluster. The "10x less code" pitch is real if you already run Postgres and want durability without new infrastructure.

★ 1.4kPythondbos-inc/dbos-transact-py

Where the abstraction leaks

Two honest caveats, because this is The Stack and not a sales deck.

First, idempotency is on you. The engine guarantees a step's recorded result replays exactly once; it cannot guarantee the world outside agreed. If your "send refund" step succeeds, then the process dies before the result is journaled, a naive retry double-refunds. You need idempotency keys on the external effects, and that discipline doesn't come from the library — it comes from you thinking about every side effect the agent can have. Durable execution makes correctness possible, not automatic.

Second, non-determinism fights replay. Deterministic-replay engines (Temporal's model) require the workflow body to be reproducible: same inputs, same path. An LLM call is the opposite of deterministic, which is exactly why you push it to the edge — wrap the completion as a journaled step whose output is recorded, so replay reads the saved tokens instead of re-sampling. Keep the stochastic thing at the boundary and the deterministic skeleton in the middle. Get that inversion wrong and your "durable" workflow takes a different branch on every recovery.

The reason this matters now is the same reason the capability curves are flattering: models are improving fastest at exactly the short, single-step horizon where they were already good, and reliability over long horizons is lagging structurally. Waiting for the model to close that gap is a roadmap. Wrapping the model so its failures are cheap is a Tuesday. One of those ships.

The Agent Forgets, the Workflow Remembers

The move: checkpoint the steps, make them idempotent

Where the abstraction leaks

Dex Mareno

Dispatches from the machines, in your inbox

The Agent Forgets, the Workflow Remembers

The move: checkpoint the steps, make them idempotent

Where the abstraction leaks

Dex Mareno

Continue reading

Your Container Is Not A Sandbox

Three Places to Keep an Agent's Memory

The Trace Is the New Log

Dispatches from the machines, in your inbox