A semantic cache is the rare optimization that can make your product cheaper, faster, and wrong, all in the same request. Most teams discover the first two properties in a blog post and the third one in production, when a user asks when the store closes and the system cheerfully tells them when it opens — because three weeks ago someone asked a question that embedded a little too close.
The pitch is genuinely good. Your agent answers the same handful of questions thousands of times, phrased a thousand ways. Why pay for, and wait on, an LLM call you've effectively already made? A semantic cache embeds the incoming query, searches a vector store for the nearest past query, and if the match is close enough, returns the stored answer. No model call. Latency drops to a vector lookup; the bill drops with it. GPTCache, the library that popularized the pattern, advertises order-of-magnitude cost and speed wins — though it's worth noting its own README pins those to "a sample benchmark" rather than documented numbers.
First, the thing it is constantly confused with
Before going further: semantic caching is not the prompt caching your provider sells you. They share a word and nothing else.
Anthropic's prompt caching and OpenAI's reuse the model's internal computation for an exact, byte-identical prompt prefix. Change one character in that prefix and the cache misses entirely. It's a KV-cache optimization — cache reads cost roughly a tenth of fresh input tokens on Anthropic, half on OpenAI — and it makes a call cheaper. There is no notion of "similar." It is the opposite of fuzzy.
Semantic caching lives outside the model. It matches different wordings of the same question and skips the call completely. One is a discount on computation; the other is a bet that two strings mean the same thing. We have written before about why prompt caching keeps missing; this is the riskier sibling, and the risk is structural.
The whole product is a threshold
Here is the single decision that determines whether a semantic cache is an asset or a liability: the similarity threshold. Embeddings give you a number — cosine similarity — for how close two queries are. You pick a cutoff. Above it, you serve the cached answer. Below it, you call the model.
That cutoff is a precision/recall dial, and the two ends are not symmetric.
Set the threshold loose and you save more money while serving more wrong answers. Set it tight and you serve fewer wrong answers while saving less money. There is no setting that gives you both. There is only the setting you chose on purpose, and the one you inherited from a tutorial.
The canonical failure, which every practitioner write-up eventually reaches for: at a 0.85 threshold, "What time does the store open?" and "When does the store close?" can land within 0.85 cosine of each other. They are lexically twins and semantically opposite. A cache tuned for savings hands the second asker the first asker's answer, with no hedge, no uncertainty, nothing to signal that it guessed.
This isn't hypothetical hand-wringing. The MeanCache paper measured exactly this on contextual queries and found a naive GPTCache configuration produced 54 false hits where their approach produced 3. Fifty-four confidently wrong answers, from a system whose entire value proposition is being right enough to skip the model.
How to use one without getting burned
Semantic caching is worth doing. It's just not worth doing casually. The rules that separate the cost win from the support ticket:
- Start strict, then loosen with evidence. Begin around 0.9+ cosine. Then evaluate on a real query log — not invented examples — counting true hits, false hits, and misses. Tune the threshold to hold false hits inside your tolerance (a coding helper's tolerance is not a bank's). Vendors expose this knob: Redis LangCache and Portkey's semantic cache both let you set the distance threshold; the default is a starting point, not an answer.
- Never cache what isn't shared. Anything personalized, account-specific, time-sensitive, or dependent on conversation state must bypass the cache. "What's my balance?" has no business hitting a shared semantic store, and "what's the latest model?" rots by the week.
- Plan for invalidation you can't cleanly do. An embedding-keyed entry has no tidy "delete everything about topic X" — you'd have to search the embedding space to find it. So lean on TTLs and event-based purges, because the nastier failure isn't a fresh wrong answer. It's a cached wrong answer — a once-correct response that went stale, now re-served to every similar question that follows until something expires it.
The embedding model you cache with matters as much as the LLM you're skipping; a stronger retrieval model means tighter, more trustworthy matches, and the vector store is the same infrastructure you already run for RAG.
Semantic caching rewards teams who treat it as a retrieval problem with a correctness budget, and punishes teams who treat it as a config flag. The cache will always be happy to answer. The only question is whether you taught it when to keep its mouth shut.



