The first time you wire a human approval into an agent, you reach for a modal dialog. The agent wants to issue a refund; you pop a confirm button; the human clicks; the refund goes through. It works in the demo. Then it goes to production, the approver takes four hours to respond, your server redeploys in the meantime, and the agent that was "waiting" simply evaporates — along with the refund it was about to issue and any memory that it was mid-task.
That failure is the whole lesson. Human-in-the-loop is not a UI feature. It is a state-persistence problem wearing a UI costume. The button is trivial. The hard part is that an agent must stop, hold its exact position — which tool, which arguments, which step in a multi-call plan — for an interval you don't control, and then resume as if nothing happened. That is precisely the requirement durable-execution systems were built for: pausing for an arbitrary duration and surviving the gap is the same engineering whether the gap is a human's lunch break or a worker crash.
The four things a human actually does
Strip away the frameworks and HITL collapses to four interactions, all variations on pause → surface → resume: approve or reject a proposed tool call, edit the action or state before it runs, answer a question the agent asked, or review intermediate state and continue. The overwhelmingly common case is the first one — gating a sensitive, irreversible action. You don't pause an agent to double-check a search query; you pause it before it deletes the account.
LangGraph says the quiet part out loud
LangGraph's design makes the thesis unavoidable. You pause with interrupt(value), which halts the graph and surfaces a value to the client, and you resume by invoking the graph with Command(resume="the human's decision"). The non-negotiable detail: interrupt() requires a checkpointer. The runtime doesn't warn or degrade — it raises RuntimeError("Cannot use Command(resume=...) without checkpointer"). The docs are blunt that the feature "relies on persisting graph state." The pause and the persistence are not two features that cooperate; they are one feature. There is no pausing without saving, because a pause you can't restore from isn't a pause, it's a memory leak.
This buys a brutal gotcha worth tattooing on your wrist: on resume, LangGraph re-runs the entire node from the top, re-executing all logic before the interrupt() call. So if your node charges a credit card and then calls interrupt() for approval, the human's approval re-enters the node and charges the card a second time. The fix is structural, not clever: put the interrupt() at the very top of the node, or push side effects into their own nodes, or make them idempotent. (If one node interrupts twice, resume values are matched by call order — another reason to keep nodes small.)
A pause you can't restore from is not a pause. It's a memory leak with a confirmation button.
Same idea, four dialects
Every serious framework lands on the same architecture; they just disagree on ergonomics and on who owns the storage.
The OpenAI Agents SDK lets a tool declare @function_tool(needs_approval=True). When the agent hits it, the run pauses and RunResult.interruptions fills with ToolApprovalItem entries; you call state.approve(...) or state.reject(...) and resume with Runner.run(agent, state). Crucially, it separates short-lived approvals (same process) from long-running ones — for the latter you call RunState.to_json(), write the blob to your own durable store, and rehydrate with RunState.from_json() later, "potentially in a different process or after server restart." The persistence is explicit and yours.
Pydantic AI routes HITL through its general deferred tools machinery, which is the tell that matters: marking a tool requires_approval=True makes the run end with a DeferredToolRequests object, and you resume by passing back DeferredToolResults (ToolApproved, optionally with override_args to rewrite the model's arguments, or ToolDenied with a message). "Wait for a human" is mechanically identical to "wait for any slow external result" — they share one code path. That's not a shortcut; that's the correct mental model.
Temporal is the purest statement of the thesis, because it's a durable-execution platform first and an agent thing second. A workflow waits for a person by awaiting a condition (workflow.wait_condition) on state set by a @workflow.signal handler — a CFO clicking "approve" is just a Signal injected into a running workflow. Because Temporal persists its Event History to a database, the agent can wait "hours, days, or indefinitely" without consuming compute, and if a worker crashes mid-wait it replays history to reconstruct the exact pre-crash state, Signals included. It also hands you the thing the others make you build by hand: durable timers for "escalate if no human answers in 24 hours." This is the same durability axis the durable-agent frameworks compete on — HITL just falls out of it for free.
What this means for your code
Pick your framework by where you want the state to live, because that's the only real decision. If you're happy letting the framework own a database-backed checkpoint, LangGraph's interrupt() is the least code. If you want to serialize the run yourself and stash it in your own store, the OpenAI Agents SDK's RunState is honest about that. If your agents already need durable execution for other reasons, do HITL in Temporal and stop thinking about it. The choices between them differ in the same dimensions the agent SDKs differ everywhere else.
But whichever you choose, design the approval as a resumable checkpoint, not a blocking call — and put nothing irreversible before the pause. The human clicking "approve" four hours late is not the edge case. It's the normal case, and the only agents that survive it are the ones that were never really waiting in memory at all.



