Ask a database to undo a half-finished change and it obliges: ROLLBACK, and the rows you touched snap back as if nothing happened. Ask an AI agent the same thing after it has booked the flight, charged the card, and then failed to reserve the hotel, and there is no verb for it. The flight is booked. The money moved. The world does not have a rollback log.
This is the gap under every "agent that takes actions in production." Retries and timeouts get all the attention — and they matter (here's how to handle the API-level failures's neighbor problem) — but they answer the wrong question. Retries ask how do I make this step happen? The harder question is what about the steps that already did? When an agent strings together three or four side-effecting tool calls and one in the middle fails, you are not left with an error. You are left with a partially-changed world and no way to put it back.
Borrow the answer from distributed systems#
Microservices hit this wall a decade ago, and the fix has a name: the saga, first described by Garcia-Molina and Salem in 1987 and made famous by Chris Richardson's microservices patterns. A saga replaces one impossible distributed transaction with a sequence of local ones, each paired with a compensating action — a defined, business-level undo. Reserve inventory; its compensation releases it. Draft an email; its compensation deletes the draft. If any step fails, the saga runs the compensations for everything that already succeeded, in reverse order, until the world is consistent again.
The non-obvious part is that "undo" here is semantic, not literal. You don't roll the database back; you take a new action whose effect cancels the old one. Temporal's own framing is blunt: every step includes an undo, and on failure the compensations run backward. For an agent, this means each tool needs a twin — book_flight ships alongside cancel_flight, send_invoice alongside void_invoice — and the agent's harness records which forward actions committed so it knows which twins to fire.
An agent's tool order is not a convenience. It is a correctness property.
The pivot is where the design actually lives#
Here is the rule most teams miss. Saga theory splits steps into three kinds, and the split is an ordering law:
- Compensatable transactions can be undone. Do all of them first.
- The pivot is the one irreversible commit — the point of no return. Charging a card, sending a wire, publishing a post. You get exactly one, and it goes as late as possible.
- Retriable transactions come after the pivot. Because the pivot succeeded, the system is committed to finishing forward, so these must be built to eventually succeed (read: idempotent) and must never be allowed to fail the saga.
Translate that to agents and the design rule writes itself: do everything reversible first, place the single unrecoverable action last, and put nothing risky after it. Most agent frameworks do the opposite — they hand the model a flat toolbox and let it choose order freely. So the LLM is free to charge the customer in step two and then trip over a flaky calendar API in step four, leaving you with money taken for a booking that never completed. The irreversibility didn't change; the position did, and position was the whole game.
Idempotency and compensation are two different halves#
It's tempting to think you've covered this with idempotency keys. You haven't. Idempotency protects against doing the same thing twice; compensation protects against being unable to undo a thing you did once. They fix opposite failures. The danger is real: a recent survey of tool-using agents notes that after a checkpoint restore, an LLM re-synthesizes a subtly different request, so the downstream service treats it as new — duplicate payments, reused credentials — and no surveyed framework enforced exactly-once at the tool boundary. Idempotency keys kill the duplicate. They do nothing for the orphaned booking when a later step dies. You need both, and they are not the same line of code.
Keep the saga out of the model#
The last mistake is letting the LLM run the recovery. It can't. The model is stateless across the failure and re-plans on every turn, so "remember to cancel the flight you booked four steps ago" is exactly the kind of bookkeeping it drops. The commit/compensate log belongs in a durable orchestrator that survives crashes and owns the state machine — the same layer you'd reach for in checkpointing-vs-durable-execution and the durable-agent runtimes. The model proposes the next action; the orchestrator records it, executes it, and — when something downstream breaks — walks the compensation stack backward without asking the model's permission. IBM's research prototype of an undo-and-retry agent makes the same bet: an explicit undo operator per action, owned by the system, not the reasoning.
A saga is not a safety net you bolt on after a bad demo. It's a state machine that guarantees one of two outcomes: the business process completes, or its partial work is semantically undone. Decide which of your agent's tools can be taken back, order them so the one that can't goes last, and give the rest a twin. The agent still can't say ROLLBACK. But you can build the thing that means it.



