Two features, both called "caching," both pitched as the fix for a runaway token bill. It's tempting to treat them as the same lever at two settings. They aren't. Prompt caching and semantic caching cache different objects, sit at different layers of the stack, and — this is the part that matters — fail in different ways. One of them can only ever save you money or do nothing. The other can hand your user a wrong answer and never tell you.

What each one actually caches#

Prompt caching reuses an identical prompt prefix. When many of your requests begin with the same large block — a long system prompt, a fat tool schema, a few-shot preamble, RAG boilerplate — the provider can keep the computed attention state (the KV cache) for that prefix and skip recomputing it next time. The match is exact: byte-for-byte the same prefix, or no hit. It lives inside the provider, and the model still runs — you just pay less for the repeated tokens. The various flavors (implicit vs explicit, and the pricing differences across Anthropic, OpenAI, Gemini, and Bedrock) are real, but they're variations on one safe idea. It's also worth not confusing it with the inference-engine's prefix caching, which is the same trick one layer down.

Semantic caching reuses a whole response. It embeds the incoming query, runs a vector-similarity lookup against the queries it has seen before, and if the nearest one is close enough — above a cosine-similarity threshold — it returns that query's stored answer and never calls the model at all. The canonical example: "What is RAG?" and "Can you explain retrieval-augmented generation?" sit within about 0.05 cosine distance of each other, so they can safely share one answer. On a hit, you skip the entire generation; vendors like Redis LangCache and libraries like GPTCache cite cost reductions of up to ~90%.

That's the seduction — and it's genuine. But look at where the two mechanisms put their trust.

One fails as a miss, the other as a wrong answer#

Prompt caching's contract is exact-match. Its worst possible outcome is a cache miss: the prefix didn't line up, you pay full price, life goes on. It is structurally incapable of producing a wrong answer, because it never decides that two different things are "the same." You can turn it on almost everywhere and forget about it.

Semantic caching's contract is similarity, and similarity is a guess. The entire behavior hinges on one number — the threshold — and that number is a trap on both ends:

Prompt caching's worst case is a miss. Semantic caching's worst case is a wrong answer that looks exactly like a right one.

There is no threshold that gives you both maximal savings and zero false hits, because "these two questions deserve the same answer" is a judgment your embedding model is approximating, not a fact it knows.

The decision is a correctness budget, not a savings contest#

Which is why "which one saves more?" is the wrong question. They aren't competitors — they stack cleanly: prompt caching discounts your repeated prefix, semantic caching skips whole calls for recurring questions. The real question is how much wrongness each part of your product can absorb.

Prompt caching: turn it on broadly. It's deterministic and safe. The only work is keeping your prefixes stable and well-ordered so they actually hit.

Semantic caching: gate it behind a correctness budget. It earns its place in FAQ-shaped domains — support, docs Q&A, onboarding — where many differently-worded questions genuinely share one canonical answer. Use it there with a high threshold, a TTL so a once-true answer doesn't outlive its truth, and, where you can, human-verified canonical answers behind the cache so a hit returns something you've blessed. And keep it away from anything personalized, time-sensitive, or high-stakes — account-specific, medical, legal, financial — where a plausible-but-wrong answer is expensive.

So before you reach for the smarter-sounding one, ask the only question that separates them: can this product tolerate a confidently wrong answer in exchange for the saving? If the answer is no, semantic caching doesn't get switched on until it's wearing guardrails. Prompt caching, meanwhile, you should probably have on already.