A semantic cache is the rare optimization that can make your product cheaper, faster, and wrong, all in the same request. Most teams discover the first two properties in a blog post and the third one in production, when a user asks when the store closes and the system cheerfully tells them when it opens — because three weeks ago someone asked a question that embedded a little too close.

The pitch is genuinely good. Your agent answers the same handful of questions thousands of times, phrased a thousand ways. Why pay for, and wait on, an LLM call you've effectively already made? A semantic cache embeds the incoming query, searches a vector store for the nearest past query, and if the match is close enough, returns the stored answer. No model call. Latency drops to a vector lookup; the bill drops with it. GPTCache, the library that popularized the pattern, advertises order-of-magnitude cost and speed wins — though it's worth noting its own README pins those to "a sample benchmark" rather than documented numbers.

First, the thing it is constantly confused with

Before going further: semantic caching is not the prompt caching your provider sells you. They share a word and nothing else.

Anthropic's prompt caching and OpenAI's reuse the model's internal computation for an exact, byte-identical prompt prefix. Change one character in that prefix and the cache misses entirely. It's a KV-cache optimization — cache reads cost roughly a tenth of fresh input tokens on Anthropic, half on OpenAI — and it makes a call cheaper. There is no notion of "similar." It is the opposite of fuzzy.

Semantic caching lives outside the model. It matches different wordings of the same question and skips the call completely. One is a discount on computation; the other is a bet that two strings mean the same thing. We have written before about why prompt caching keeps missing; this is the riskier sibling, and the risk is structural.

The whole product is a threshold

Here is the single decision that determines whether a semantic cache is an asset or a liability: the similarity threshold. Embeddings give you a number — cosine similarity — for how close two queries are. You pick a cutoff. Above it, you serve the cached answer. Below it, you call the model.

That cutoff is a precision/recall dial, and the two ends are not symmetric.

Set the threshold loose and you save more money while serving more wrong answers. Set it tight and you serve fewer wrong answers while saving less money. There is no setting that gives you both. There is only the setting you chose on purpose, and the one you inherited from a tutorial.

The canonical failure, which every practitioner write-up eventually reaches for: at a 0.85 threshold, "What time does the store open?" and "When does the store close?" can land within 0.85 cosine of each other. They are lexically twins and semantically opposite. A cache tuned for savings hands the second asker the first asker's answer, with no hedge, no uncertainty, nothing to signal that it guessed.

This isn't hypothetical hand-wringing. The MeanCache paper measured exactly this on contextual queries and found a naive GPTCache configuration produced 54 false hits where their approach produced 3. Fifty-four confidently wrong answers, from a system whose entire value proposition is being right enough to skip the model.

How to use one without getting burned

Semantic caching is worth doing. It's just not worth doing casually. The rules that separate the cost win from the support ticket:

The embedding model you cache with matters as much as the LLM you're skipping; a stronger retrieval model means tighter, more trustworthy matches, and the vector store is the same infrastructure you already run for RAG.

Semantic caching rewards teams who treat it as a retrieval problem with a correctness budget, and punishes teams who treat it as a config flag. The cache will always be happy to answer. The only question is whether you taught it when to keep its mouth shut.