The bug report is always the same screenshot: two identical emails, or two charges thirty seconds apart, and logs that show one user action. Someone added durable execution that week, or turned on a checkpointer, specifically to make the agent more reliable. They're now staring at evidence that it made things worse, and they don't yet know why.

They will, because the why is structural. The reliability feature they added is at-least-once replay, and at-least-once is exactly what double-sends an email.

Exactly-once is a thing you cannot buy

Start with the impossibility, because everything downstream is a workaround for it. You cannot have exactly-once delivery across an unreliable network. A request goes out, the server processes it, and the acknowledgment is lost in transit — so the caller can't tell "it worked but the reply died" from "it never arrived." This is the Two Generals Problem, and no protocol talks its way out of it. Stripe's idempotency post opens with the flat version of this: "Networks are unreliable."

So you stop chasing exactly-once delivery and chase exactly-once effect. The achievable thing is at-least-once delivery plus an idempotent consumer, which together produce what the literature calls effectively-once. The message may arrive five times; the charge happens once. Note the shape of that deal: at-least-once is the floor you're given, and idempotency is the part you build. Nobody ships it to you.

Durable execution hands you the floor, not the ceiling

Here is the part teams get backwards. Durable execution and checkpointing do not reduce duplication. Their entire value proposition is to re-run a step that may have already run, because after a crash they can't know whether the step completed before the worker died — the lost-acknowledgment problem again, one layer down.

Temporal is honest about this. Its docs state activities follow an at-least-once execution model: if a worker runs an activity successfully but crashes before reporting back, the activity is retried. The guidance that follows isn't a footnote — it's the contract: activities "should be designed to be safely executed multiple times without causing unexpected or undesired side effects." The engine guarantees your step runs. It explicitly does not guarantee it runs once. (Which engine to pick is the separate Temporal vs Inngest vs Restate question.)

Durable execution doesn't make the duplicate-email bug less likely. It makes it more likely, because replaying the step that sends the email is the feature, not the failure.

LangGraph makes the trap visible in source code. Wrap a node around an interrupt(...) and resume it, and the node does not continue from the next line — on resume, execution "starts at the beginning of the node." Every statement before the interrupt runs again. If your send_email sits above the interrupt, the human approves once and the customer receives two. This is the same realization that separates checkpointing from real durable execution: persisting state and re-entering a node is precisely what re-fires the side effect.

The agent adds a second duplicator

Backend engineers have fought at-least-once for decades. What's new — and what makes this a genuinely different problem for agent developers — is that the model is a second, independent source of duplication stacked on top of the network.

A normal retry loop re-sends a request because the network dropped the reply. An agent loop does that too, but it can also re-emit the tool call from the top: feed the conversation back and the model, not seeing a tool result it trusts, may call send_email again — a semantic duplicate, not a network one. Now you have two duplication sources producing identical-looking side effects for different reasons, and dedup that lives in your retry library catches only one. Anthropic's tool-use model is clear that the model emits the request and your code executes it; the model never sends the email itself. That's the opening: the dedup has to live below the model, in the tool, where both kinds of duplicate funnel through one door.

The fix is forty years old: key it before you call

The pattern that works is Stripe's, and it predates agents entirely. The caller generates an idempotency key, sends it with the request, and the server saves the result of the first request under that key — then replays that saved response for every retry carrying the same key, success or failure alike. Stripe scopes it to POST requests (GET and DELETE are idempotent by definition) and suggests a UUIDv4.

The one detail that matters more than the rest: the key must be attached before the side effect, not reconstructed after the crash. A key minted fresh on each attempt defeats the entire mechanism — every retry looks new. So for agents, you do not generate a random key per call. You derive it deterministically from the semantic content of the request. Temporal's own recipe is to combine the Workflow Run ID with the Activity ID, giving a key that is constant across retries of the same logical step but unique across runs. Port that to a tool: hash the meaningful inputs — recipient, intent, the run-and-turn identity — into the key. Then a network retry and a model re-emission both produce the same key, and the downstream service collapses them into one effect. The model can ask twice; the email leaves once.

Two structural moves make it robust. Prefer natural idempotency where the API offers it — a PUT to a known resource ID is safe to repeat in a way a POST never is, so design the tool around upsert-by-id instead of create-on-call. For operations that resist that, use reserve-then-confirm: a first idempotent call stakes a claim keyed to the request, a second confirms it, and a replay of either step lands on the same row instead of a new one.

None of this requires giving up durable execution. It requires understanding what durable execution actually sold you — a reliable re-run — and putting the key in place before the re-run can hurt. The replay was never the bug. The unkeyed call underneath it was.