Most teams pick a caching strategy the way they pick anything with a price tag: they find the bigger discount and take it. That instinct is wrong here, and it's wrong in a specific, fixable way. The discount on a cache read is the same whether the hit arrived implicitly or because you asked for it explicitly. On Gemini 2.5 and Anthropic alike, a cached input token costs 90% less than a fresh one; on OpenAI it's 50% off. Implicit, explicit — same number on the invoice line that actually dominates your bill.

So if cost-per-hit is identical, what are you actually choosing between?

A gamble you can't shape, or a contract you pay for#

The honest framing is that implicit caching is a discount you hope lands, and explicit caching is a discount you buy. Google says this almost in those words: in its own documentation, implicit caching is offered with no cost-saving guarantee, while explicit caching "ensures a discount" on tokens that reference the cache. That isn't marketing hedging. It's a precise description of two different products.

Implicit caching is the freebie. OpenAI applies it automatically to any prompt over 1,024 tokens, matching the longest previously-seen prefix in 128-token increments, with no code change and no surcharge. Gemini 2.5 does the same by default. You pay nothing extra, and when a request happens to share a prefix with a recent one, you're quietly billed less. Lovely — until the request doesn't share a prefix, and you're billed full freight, and nothing tells you.

Explicit caching is the contract. You name the boundary. With Anthropic you place a cache_control marker on up to four blocks; with Gemini you create a CachedContent object with a TTL. In return for a guaranteed discount on every hit, you pay up front — Anthropic charges 1.25x base input to write a 5-minute cache, 2x for a one-hour one; Gemini charges storage rent for as long as the cache lives. The premium is the price of certainty.

Implicit caching is a discount you hope lands. Explicit caching is a discount you buy. The read rate is the same; what you're paying for is the guarantee.

The two providers that force the choice for you#

Anthropic and OpenAI sit at the two ends and barely give you a decision: Anthropic's caching is explicit-only (there's no zero-config mode), and OpenAI's is implicit-only (there's nothing to declare). Gemini is the one platform where you genuinely choose — and the choice is made for you by size, not preference. Implicit caching on Gemini 2.5 starts around 1,024 tokens. An explicit cache has a 32,768-token minimum. Below that floor, explicit isn't an option; implicit is the only caching you get. Explicit caching is built for the cases where the reused context is large and stable — a sprawling system prompt, an attached document, a whole codebase you reason over repeatedly.

That size floor is the tell. Explicit caching isn't "implicit, but better." It's a different tool for a different shape of workload.

The failure modes are opposites, and so is the work#

Here's the part worth internalizing, because it changes what you actually build. The two strategies fail in opposite directions, which means they demand opposite engineering.

Implicit caching fails silently and for free reasons. Your prefix is invalidated by a single early byte that moved — a timestamp injected into the system prompt, tool definitions serialized in a new order, a per-request greeting prepended instead of appended. The cache misses, you pay full price, and there is no error, no warning, no log line. The work implicit caching demands is prefix discipline: everything fixed goes at the front, byte-for-byte stable; everything that varies goes at the end. Treat your prompt layout as a cache key, because that's exactly what it is.

Explicit caching fails expensively and for math reasons. You paid the write surcharge or the storage rent, and then you didn't reuse the cache enough times within its TTL to earn it back. A one-hour Anthropic cache written at 2x base input that gets read twice is a worse deal than no cache at all. The work explicit caching demands is amortization math: estimate how many hits you'll get before the cache expires, and only pay for the breakpoint when that number clears the break-even.

So when do you pay?#

Reach for implicit caching when traffic is spiky or varied, when the reusable prefix is modest, and when you can keep that prefix byte-stable — you get a real discount for the price of disciplined prompt construction, and the only cost of a miss is a missed discount.

Reach for explicit caching when a large context is reused on a predictable cadence, and especially when your agent's cost and latency need to be deterministic — a long-running agent loop that replays the same 40k-token system prompt on every step shouldn't be gambling on whether the discount shows up. There you want the guarantee, you'll clear the break-even easily, and the predictability is worth more than the surcharge.

The mistake isn't choosing one over the other. It's choosing on the read discount — the one number that's the same either way — instead of on the thing that actually differs: whether you can afford to hope. (For the adjacent distinctions, see prefix caching vs prompt caching and the full cross-provider pricing breakdown.)