Write a test for an ordinary function and you assert that an input yields an output. Write the same test for an agent and the ground gives way: the model is non-deterministic by construction, so the "expected" value drifts run to run, and every execution bills you for live tokens and can fail because a provider hiccuped, not because your code broke. The instinct is to crank temperature to zero and pin a seed. It doesn't hold — a tool returns a new timestamp, the model picks a different phrasing, and the assertion shatters anyway.
The technique that actually makes agent tests stable is borrowed wholesale from a twenty-year-old idea: record and replay. Run the agent once against the real model and tools, capture that run to a file — a cassette — commit it, and on every later run replay the recording instead of calling out. VCR.py coined the cassette for HTTP; pytest-recording wraps it in a @pytest.mark.vcr() decorator that records on first run and replays forever after. Zero live calls, zero dollars, millisecond runs, identical every time. Problem solved.
Except it isn't one technique. It's a layering decision, and the layer you record at silently decides which bugs your suite can still catch.
Two layers, two different bugs#
You can record an agent at the network layer or at the decision layer, and they are not interchangeable.
The HTTP-cassette tools — VCR.py, Docker's cagent, the MCP-focused agent-vcr — record the raw request and response bytes on the wire. That's the cheap, provider-agnostic path, and for what it's built to do it's excellent: agent-vcr captures the JSON-RPC traffic between an MCP client and server into .vcr files, then offers agent-vcr diff --fail-on-breaking to fail CI the moment a server's contract drifts. But notice what a network recording freezes. It freezes everything that crossed the wire — which includes every tool's output. So on replay, your tool code never runs. Its result is served from the cassette.
That is the trap nobody says out loud. Introduce a bug in a tool — an off-by-one in the function that reads a file, a broken parse — and the HTTP-cassette test stays green, because the tool was never invoked. The recording answered for it.
The decision-layer tools invert the bet. langchain-replay records only the model's choices — which tool it called, with which arguments, and what text it produced — and on replay it re-executes your real tools. Its README is blunt about the distinction: HTTP cassettes "never let your tool code actually run, so tests stop reflecting reality," whereas decision replay "yields those recorded decisions while actually executing the tools." Now the off-by-one fails the test, because the file actually got read. The cost: the model was never truly called, so this layer is blind to provider drift, a serialization change, or a contract break — the exact thing agent-vcr's diff exists to catch.
A network cassette freezes the model AND your tools, and tests your wire. A decision cassette freezes only the model's choices, and tests your code. Pick the one that watches the bug you're afraid of.
The layer follows the bug class#
So the choice is not "fast vs. real" or "simple vs. precise." It's a question about what you are defending against. If your fear is cost, provider flakiness, and a third-party API or MCP server quietly changing its shape, record the HTTP layer — you want the wire pinned and a diff that screams when it moves. If your fear is that you will break your own agent — a regression in a tool, a routing bug, a prompt-assembly mistake — record decisions, because only then does your code execute under test. Many teams reach for whichever cassette library their language ships and never ask the question, then spend an afternoon baffled that a visibly broken tool ships green. This is the same boundary that separates evals from tests: evals judge whether the output is good; replay tests assert that a known-good run still reproduces. You want both, and you want to know which one you're writing.
The gotcha that breaks replay quietly#
Even once you've picked a layer, agents break the cassette assumption that ordinary HTTP testing never has to think about: the request isn't stable. A replay only works if the incoming request matches a recorded one — and agent requests are full of volatile junk. Every turn, the provider stamps a fresh random tool-call ID; the model echoes timestamps and nonces back in the body. Match the request byte-for-byte and every replay is a cache miss; match too loosely and you'll happily replay a stale answer for a prompt that changed. This is why cagent has to normalize tool-call IDs before matching — without it, OpenAI's randomly generated IDs would defeat replay on every run. The lesson generalizes: before record/replay is reliable, you must decide which fields of the request are identity and which are noise, and strip the noise from the match key.
One more line item, because cassettes get committed: scrub the secrets. cagent strips Authorization and X-Api-Key automatically; with VCR.py you wire up filter_headers yourself, or your API key rides into the repo inside a "safe" test fixture. A recorded run is a real run, frozen — handle it like one. Get the layer and the match key right, though, and the most expensive, flakiest tests in your suite become the cheapest and the steadiest — and the agent that can't be debugged live becomes one you can replay, frame by frame, on demand.



