A web request that dies halfway through is annoying. An agent that dies halfway through has already spent money and, more often than developers expect, already touched the outside world — it read a row, called a paid model three times, filed a ticket, maybe charged a card. "Retry" is the wrong verb for that situation, because retry means from the top, and from the top means charging the card again. What you want is resume: continue from what the agent had already done, redoing none of it.

There are two ways to build that, and the interesting part is that they fail in opposite directions.

Snapshot the state, or replay the code#

The first way is state snapshotting. LangGraph does this with checkpointers: compile a graph with a saver and the runtime writes a snapshot of the entire graph state — channel values and the set of next nodes — after every super-step, keyed by a thread. Crash, restart, hand it the same thread ID, and it loads the last snapshot and picks up at the pending node. Completed nodes are not re-run; their outputs are already baked into the restored state. Swap InMemorySaver for SqliteSaver or PostgresSaver and the same snapshots outlive the process.

The second way is durable execution — Temporal, and newer entrants like DBOS and Restate. Here nothing is snapshotted. The engine records an ordered event history — workflow started, activity scheduled, activity completed, timer fired — and on recovery it replays your code from the beginning, comparing the operations your code emits against that history. Where an operation already has a matching event, the engine feeds the recorded result straight back in without doing the work again. Run the code, re-derive the state.

Diagrid, whose whole pitch is durable agents, likes to needle the first camp with a slogan: checkpoints are not durable execution. They have a point about the mechanism. But both techniques are trying to buy the same thing — a run that can pick up where it fell over — and each one buys it with a different, very specific liability.

The checkpointer's liability: the super-step is atomic#

A LangGraph checkpoint is written between super-steps, which means the super-step is the unit of durability. Now picture a single node that does three things: call the model, send the email the model drafted, return. That node is atomic to the checkpointer — there is no snapshot inside it. If the process dies after the email goes out but before the checkpoint commits, resume re-executes the whole node from the top.

The model gets called again — new cost, and, worse, a different answer. And the email? It goes out a second time.

Nothing about the checkpointer is broken here; you simply asked for durability at a granularity coarser than your side effect. The fix is on you: make the node idempotent with an idempotency key, or split the irreversible action into its own node so the checkpoint boundary lands immediately after it. The checkpointer will faithfully resume — it just can't know that "send email" was the part you couldn't afford to repeat.

The replay trap: you cannot replay an LLM#

Durable execution moves that ownership into the engine, and charges a tax for it. Replay only works if the code is deterministic: the same history must drive the same sequence of operations every time. Temporal is explicit that a workflow definition "must be deterministic," and that any operation which could take a new path — a clock read, a random number, a network call — has to be pushed into an Activity, which runs outside the replay path and whose result is written to history.

An LLM call is the most non-deterministic operation in your program. Put it directly in workflow code and the first replay will call the model again, get a different completion, emit a different next step than the one in history, and the engine will halt with a non-determinism error. That is the replay trap, and it is the whole reason durable-execution integrations for agents — Pydantic AI's Temporal support, the durable-execution layer that landed in Vercel AI SDK 7 — exist: they quarantine every model and tool call as an activity so the result is recorded once and returned from history forever after. The orchestration replays; the thinking never does.

What you're actually saving is the observation#

Line the two failure modes up and they rhyme. The checkpointer's rule: don't let a re-executed step repeat a side effect. The durable engine's rule: don't let replay repeat a non-deterministic call. Both are the same rule wearing different clothes — never redo a step whose output you can't reproduce — and both solve it the same way, by persisting that output and continuing past it.

Which tells you what the real state of an agent is. It is not the prompt, and it is not the code; those are cheap and you have them already. The irreplaceable state is everything the agent has observed — every tool result, every model completion, every row it read. Those are the facts a rerun cannot regenerate, because regenerating them means paying for them again or, for a non-deterministic call, getting a different fact entirely.

So the design question isn't "checkpoints or durable execution." It's: where do my observations get written, and can I continue from them without re-taking any action that already happened? Get that right and the same machinery gives you crash recovery, exactly-once side effects, and human-in-the-loop for free — because a person hitting "approve" is just another durable wait, an agent parked on the last thing it saw, holding its place in the log.