There are three different things in LLM serving called "caching." They live at three different layers, they cache three different objects, and exactly one of them can quietly hand a user the wrong answer. The word is the same, so teams treat them as interchangeable — and then enable the one they didn't mean. Here's how to tell them apart, in the order you'll meet them.
One: prompt caching (it's on your bill)
If you call a hosted model, "caching" means the provider's billing feature. You mark a stable prefix — a long system prompt, a tool catalog, a stack of few-shot examples — and the provider keeps the computed state around so it doesn't reprocess those tokens next call. The savings land on your invoice.
The numbers are worth memorizing because they decide your prompt layout. Anthropic's prompt caching reads a cached prefix at 0.1x the normal input price — a 90% discount — while a cache write costs 1.25x, on a roughly five-minute sliding TTL, with a 1,024-token minimum before anything caches at all. OpenAI does it automatically for prompts over 1,024 tokens, matching in 128-token increments and routing repeat prefixes to a warm server with no code change. Gemini splits it into implicit (automatic) and explicit (you create a cached-content object) modes.
The operational rule falls straight out of the pricing: put the stable bytes first and keep them byte-identical. Inject a timestamp or reorder your tools at the top of the prompt and you've moved the cache boundary, blown the write, and paid full freight. This is the layer the existing playbook on why your prompt cache keeps missing is about — and it's the only one that ever touches a dollar figure.
Two: prefix caching (it's in your GPU)
Now self-host the model and the word changes meaning underneath you. There's no provider, no bill, and no cache_control parameter — yet your inference engine is doing something that looks identical from the outside and is completely different inside.
When a model processes a prompt, it builds a KV cache: the attention key/value tensors for every token, held in GPU memory. Automatic prefix caching keeps those tensors around so that the next request sharing the same prefix skips the prefill for it entirely. vLLM's implementation hashes each KV block by its tokens plus everything before it and reuses physical blocks across requests on top of PagedAttention; SGLang's RadixAttention organizes the same idea as an LRU radix tree of cached prefixes. Either way you flip one switch (enable_prefix_caching in vLLM; on by default in SGLang) and concurrent requests that share a system prompt stop recomputing it.
Prompt caching reuses tokens you'd have been billed for. Prefix caching reuses tensors you'd have recomputed. One shows up on an invoice; the other only shows up in your p99 latency.
This is why "we turned on caching and the bill didn't move" is a category error. Engine-level prefix caching can't change your bill because there is no bill — it buys you lower time-to-first-token and a bigger effective batch, which is the same KV-memory pressure that KV-cache quantization and prefill/decode disaggregation also fight over. If you're choosing an engine, this is table stakes; it ships in vLLM, SGLang, and the rest.
Note what both of these caches share: they reuse computation for an identical prefix and produce the exact same output tokens they would have otherwise. They are free wins. You never have to reason about correctness.
Three: semantic caching (it can be wrong)
The third one breaks that guarantee, and that's the whole point to internalize.
Semantic caching, of which GPTCache is the canonical implementation, doesn't cache tokens or tensors. It caches responses. A new query gets embedded, a vector store finds the most similar past query, and if the similarity clears a threshold, the stored answer is returned and the model is never called. That's not skipping recomputation of an identical prefix — it's skipping the model on the bet that two different questions are close enough to share an answer.
That bet is a tuning knob, and the knob is load-bearing. Set the threshold too loose and a question that's topically near a cached one but materially different gets the old answer — confidently, with no error. Set it too tight and your hit rate evaporates. Worse, when prompts carry user-specific data, a sloppy match can return one user's response to another. The peer-reviewed write-up is candid that the similarity evaluator is where the risk concentrates.
So semantic caching isn't a default to flip on; it's a correctness decision. It earns its place for stable, factual, high-duplication traffic — FAQ-style support, repeated documentation lookups — and it's a liability anywhere the right answer depends on this user or this moment.
The one sentence to keep
Prompt caching and prefix caching skip recomputing an identical prefix and can never change your output — turn them on. Semantic caching skips the model on a similarity guess and can change your output — measure before you trust it. Three caches, one word, and the difference between a free win and a silent bug is knowing which layer you're standing on.



