Prompt Caching for AI Agents: Why Your Cache Keeps Missing

Prompt caching is the rare optimization that every major provider agrees on, advertises loudly, and gives you for a steep discount — and that most agents still fail to actually get. The discount is real: OpenAI bills cached input tokens at roughly half price, Google passes back up to a 90% saving on cache hits for its Gemini 2.5 models, and Anthropic reads cached tokens at one-tenth of the base input price. For an agent that resends a 30,000-token system prompt and tool catalog on every turn, that is the difference between a workload that pencils out and one that doesn't.

The reason it so often doesn't land is that all three implementations rest on the same unforgiving rule, and that rule is in direct tension with how an agent naturally assembles a prompt.

One rule, three vendors

Strip away the branding and the caching mechanism is identical across providers: it is a prefix match. The system hashes your prompt from the very first token forward and reuses computation up to the point where a request stops matching a previous one. Everything before the first differing byte is a cache hit; everything from that byte onward is recomputed at full price.

The providers differ mainly in how much ceremony they demand. OpenAI's caching is automatic on GPT-4o and newer — no parameter, no opt-in — and kicks in for prompts over 1,024 tokens. Google's implicit caching is likewise on by default for Gemini 2.5, with a 1,024-token floor on Flash and 2,048 on Pro. Anthropic makes you place the breakpoint yourself with an explicit cache_control marker, which is more work but buys you something the automatic systems hide: a precise read on what cached and what didn't.

Caching is not a setting you switch on. It is a property of how your prompt is assembled — and the assembly is usually written by someone who has never thought about byte offsets.

The economics have an edge that the marketing skips. Writing to the cache is not free. Anthropic charges 1.25× the base input price to write a 5-minute cache entry and 2× for the one-hour option; the read discount only arrives on the second hit. So a prefix you cache and then never reuse costs you more than not caching at all. Caching is a bet that you will see the same prefix again soon, and a one-shot request loses that bet every time.

Why agents lose the bet

Here is the friction. An agent's prompt is a living thing, and almost every natural way to make it lively breaks the prefix.

The timestamp in the system prompt. "Current time: 2026-06-21 14:32:07Z" at the top of your system message changes every single request. It sits in the prefix, so it invalidates the entire cache behind it. You are paying the write premium on every call and reading back nothing.
The tool set that varies per user. Tools render before the system prompt and the conversation. Add, remove, or reorder a tool — or serialize the tool list from an unordered map — and position zero changes, so nothing downstream can match.
The five-minute clock. Anthropic's default cache lives five minutes; the automatic systems have their own short, undocumented windows. An agent that pauses to wait on a human approval, a slow external API, or a queue can idle past the TTL and arrive at its next turn to a cold cache, paying the write tax a second time.

Each of these is invisible in normal operation. The request succeeds, the answer looks fine, and the bill is quietly 3–5× what it should be. The only tell is a single number — Anthropic surfaces it as cache_read_input_tokens, and the other providers report a cached_tokens count in usage. If that number is zero across requests that obviously share context, something upstream is mutating your prefix.

The architecture this implies

The useful reframe is that prompt caching turns prompt assembly order into a performance contract. The cheapest agent is the one whose prompt is effectively append-only: a frozen system prompt with no interpolated state, a tool list serialized in a deterministic order, and every volatile thing — the timestamp, the per-request ID, the user's latest message — pushed past the last cache breakpoint where it can change freely without disturbing what came before.

That has a second-order consequence worth sitting with. If a 50,000-token reference block reads back at one-tenth price as long as it stays byte-stable, the cost argument for retrieving smaller chunks instead of stuffing the whole thing in context gets weaker — but only for the agent disciplined enough to keep the block frozen. Caching rewards immutability, and immutability is not the default posture of a system designed to react.

None of this is exotic. It is mostly the discipline of not writing datetime.now() into a system prompt and not rebuilding the tool list per user. But it is discipline you have to design in before the first request, because the failure mode is silent: the discount you were promised simply never arrives, and nothing in the response tells you why. Check the cached-token count. It is the one line that says whether you are getting the deal everyone advertised.

Prompt Caching for AI Agents: Why Your Cache Keeps Missing

One rule, three vendors

Why agents lose the bet

The architecture this implies

Priya Sundaram

Dispatches from the machines, in your inbox

Prompt Caching for AI Agents: Why Your Cache Keeps Missing

One rule, three vendors

Why agents lose the bet

The architecture this implies

Priya Sundaram

Continue reading

The Price Fell. The Bill Rose. Both Numbers Are True.

The Megawatt You Cannot Rent

The Duty of Care Died Before Anyone Had to Meet It

Dispatches from the machines, in your inbox