The cheapest call to a language model is the one you never make. Prompt caching gets you part of the way there — it makes a call cheaper by letting the provider reuse computation on a repeated prefix — but the call still happens. Semantic caching is the more aggressive move: when a new question is close enough to one you've already answered, you return the stored answer and never touch the model at all.
The mechanism is the same everywhere, regardless of which tool ships it. Embed the incoming query into a vector. Run an approximate-nearest-neighbor search against the embeddings of every query you've served before. If the closest match clears a similarity threshold, return its cached response. If nothing clears the bar, call the model, then store the new query-and-answer pair so the next person asking something similar gets the fast path. That's it. The interesting part isn't the loop — it's the threshold, and we'll get to why it's the whole ballgame.
Three shapes for the same idea
What actually differs between tools is where the cache lives, and that's a deployment decision more than a technical one.
GPTCache is the library you embed. You own the embedding model, the vector backend (Faiss, Milvus, Redis, Qdrant), and the eviction policy. It's the most flexible and the most assembly-required, and it's worth knowing that its cadence has slowed — the last tagged release was in 2024. It still works; it's just no longer the obvious default it was two years ago.
Redis is the pragmatic answer if Redis is already in your stack. RedisVL's SemanticCache gives you the same vector-similarity logic against the database you're already running for sessions and queues, so the cache adds no new infrastructure — just an embedding hop. Redis also offers LangCache, a managed version that hides the embedding and vector-search operations behind a service, for teams that would rather not run any of it.
Gateways make it nearly free in effort. If you already proxy your model calls through Portkey (or a similar gateway) for routing, retries, and observability, semantic caching is a flag on the route. The catch: cache quality is bounded by the gateway's embedder and your ability to tune its threshold, which you may control less precisely than in a library you own. One honest note on the landscape — semantic caching is a genuine differentiator here, not a universal gateway feature. LiteLLM's proxy caching, for instance, is exact-match and Redis-oriented; "gateway semantic caching" is not a given. If token spend is the thing keeping you up at night, semantic caching is one lever among several — see how to reduce AI agent token costs for the rest.
The one knob: the similarity threshold
Here is the single non-obvious thing about semantic caching, and it's the thing every vendor benchmark quietly depends on. The similarity threshold is not a tuning detail — it's the entire risk model.
Set it too low and you get false cache hits: a question that's lexically near but semantically different ("how do I cancel my plan" vs. "how do I change my plan") matches, and you confidently serve the wrong answer. Set it too high and almost nothing matches, your hit rate collapses, and you've added an embedding call and a vector search to every request for nearly no payoff. The trap is that the distributions overlap — there's a grey zone of similarity scores where correct paraphrases and genuinely distinct questions are interleaved, and no threshold cleanly separates them. Embedding geometry alone cannot tell you which side of the line a borderline query belongs on.
A false cache hit isn't a missed saving. It's a wrong answer, served instantly, with the confidence of a real one.
The encouraging part is that the useful range is narrow and findable. The GPT Semantic Cache study reported cutting API calls by up to roughly 68.8% while keeping the correct-hit rate above 97%, with the sweet spot near 0.8 cosine similarity. Independent analyses land in the same neighborhood — optimal thresholds clustering just under 0.8. Treat those as starting points, not gospel; the right number is a function of your query distribution and how costly a wrong answer is in your domain. Measure the false-hit rate on a held-out set before you trust a number you read in a blog post — including this one.
When to not cache at all
The strongest move is sometimes to leave the feature off. Semantic caching assumes a stateless question-to-answer mapping, and that assumption breaks for whole categories of request:
- Time-sensitive answers — prices, inventory, "what happened today" — where yesterday's cached response is simply wrong.
- Personalized answers, where the same words from two users should produce different results.
- Stateful or tool-driven turns, where the "answer" is an action with side effects, not a reusable string.
- High-stakes domains — medical, legal, financial — where a near-miss false hit isn't a degraded experience, it's a liability.
For everything else — FAQs, docs assistants, repetitive support queries, the long tail of "how do I…" questions that thousands of users ask in slightly different words — semantic caching is one of the highest-leverage cost cuts available, and it composes cleanly with prompt caching rather than competing with it. Use prompt caching to make the calls you keep cheaper, and semantic caching to delete the calls you never needed to make twice. The skill isn't installing either one. It's knowing which questions are safe to answer from memory.



