A team wires up a LangGraph agent, adds a Postgres checkpointer because the docs told them to, watches a thread survive a manual restart, and concludes the agent is now crash-proof. It isn't. The demo that convinced them — kill the process, resume the thread, watch it continue — is real, but it tests the easy half of the problem and hides the hard half. Checkpointing and durable execution are related, and they are not the same thing, and the distance between them is exactly where production agents quietly double-charge a card.
What a checkpoint actually saves
LangGraph's persistence layer snapshots graph state at node boundaries. After a node returns, the channel values are written to the checkpointer; on resume, the graph reloads the last snapshot and continues to the next node. That is genuinely useful — it's what powers time travel, human-in-the-loop pauses, and resuming after a clean shutdown.
The word doing the hiding is boundaries. The checkpoint is the state between nodes, not a journal of what happened inside one. So when a node crashes halfway through — after it called the model, after it issued the refund, before it returned — there is no snapshot of that half-finished node. On resume, LangGraph re-enters the node and runs it again from the top. LangGraph's own durable-execution guide says it plainly: assume nodes re-execute on resume. Every side effect before the crash point fires a second time.
A checkpointer guarantees you can resume the graph. It does not guarantee you resume the node. The difference is one duplicate LLM call, one duplicate tool write, one duplicate charge.
The two things teams miss
The first is non-idempotency. If a node embeds a payment, an email, or a database insert and then crashes later in the same node, resume replays that effect. LangGraph's answer is the @task primitive: wrap the side effect in a task and its result is recorded, so a replay returns the cached value instead of calling the API again. The companion lever is the durability mode. LangGraph exposes three — "exit" (checkpoint only when the graph finishes, fastest, no crash recovery mid-run), "async" (persist during the next step, with a small window to lose a checkpoint on crash), and "sync" (persist before the next step). Production wants "sync"; the default trades durability for speed, and most teams never change it.
The second is concurrency. A checkpointer keys state by thread_id, but nothing in it prevents two workers from picking up the same thread_id and resuming it simultaneously. As Diagrid argued in its widely-shared 2026 piece, that leaves you to build distributed locking and lease coordination yourself — the unglamorous infrastructure a real execution engine is supposed to own. A single-process prototype never sees this. A horizontally-scaled deployment sees it the first busy afternoon.
Where Temporal draws the line
Durable-execution engines attack the problem from the other end. Temporal journals every step to an Event History and, on recovery, replays your workflow code against that history to reconstruct exactly where it stopped — down to the individual step, not the node boundary. The price of that guarantee is determinism: replayed code must produce the same decisions, so anything non-deterministic — an LLM call, a tool request, an HTTP call — has to live in an Activity, which runs outside the replay path and retries automatically. The constraint that feels annoying is the same constraint that makes "exactly where it stopped" possible.
This reframes the choice. It was never "LangGraph or Temporal." LangGraph models the agent's reasoning — the nodes, the edges, the state machine that decides what to do next. Temporal models the execution — surviving the crash, the retry, the concurrent worker. The official Temporal LangGraph integration runs both: it supports the Graph API and the Functional API, and each node or task declares whether it runs as a Temporal Activity (with timeouts and retries) or inline in the workflow (where it must stay deterministic). Temporal owns durability, so you drop the third-party checkpointer entirely. The one sharp caveat: LangGraph's in-memory Store is unavailable inside Activity-wrapped nodes, because live graph state can't cross the Activity boundary — so memory that needs to survive has to be modeled as durable state, not stashed on the node.
The decision
If your agent runs for seconds, fails rarely, and never has two workers fighting over a thread, a "sync" checkpointer with @task around the dangerous calls is enough, and reaching for a cluster is over-engineering. The moment runs stretch to minutes or hours, carry real side effects, or scale past one worker — the same threshold that makes human-in-the-loop a state problem — checkpointing stops being durability and starts being a story you tell yourself about durability. Knowing which side of that line you're on is the whole decision. The checkpointer was never lying to you; it was only ever answering the easy question.



