For two years the default architecture for "make the model answer from my documents" has been retrieval-augmented generation: embed the corpus, embed the question, fetch the nearest chunks, paste them into the prompt. RAG works, and the whole tooling economy around it — vector databases, rerankers, chunkers — exists to paper over its one structural weakness. Sometimes the retriever fetches the wrong chunk, and when it does, the model answers confidently from the wrong evidence and you never see the seam.
Cache-augmented generation is the proposal to delete the retriever entirely. The paper that named it, Don't Do RAG: When Cache-Augmented Generation Is All You Need for Knowledge Tasks, makes a deliberately provocative claim, and the mechanism is simpler than RAG, not more complex.
What CAG actually does
You take your entire knowledge base — every document the model might need — and load it into the context window once. You run a single forward pass and save the resulting key-value cache, the model's internal representation of that text. Then, as the reference implementation puts it, "during inference, the preloaded KV-cache enables the model to generate responses directly, eliminating the need for retrieval." Every query reuses the same cache. There is no index, no embedding model, no nearest-neighbor search at query time.
The naive objection — "isn't that just stuffing everything in the prompt?" — misses the cache. Prepending your whole corpus to every request would re-encode it on each call, which is exactly the long-context cost everyone tries to avoid. CAG pays that encoding cost a single time and reuses the result. It is long-context prompting plus KV-cache reuse, and the reuse is what makes it tractable.
On SQuAD and HotPotQA, the paper reports CAG matching or beating both sparse and dense RAG by BERTScore while removing retrieval latency outright. That result is less surprising than it sounds: if the relevant passage is already in context, you have removed the only step where RAG can fail to find it.
RAG's failure mode is fetching the wrong chunk. CAG's answer is to fetch nothing — because everything is already in the cache.
The axis that actually decides it
The benchmarks make CAG look like a clean upgrade. It isn't, and reading the comparison as "which is more accurate" or "which is faster" leads you to the wrong system. The decision turns on two properties of your knowledge base, neither of which is quality.
The first is size. CAG requires, in the repo's own words, that "the entire knowledge source fit within the context window." A product manual, an internal policy handbook, an API reference, the docs for one codebase — these are kilobytes to low megabytes of tokens, and they fit. A corpus of ten million support tickets does not fit in any context window that exists, and no amount of cleverness changes that. RAG scales to corpus sizes CAG structurally cannot touch.
The second, and the one teams underweight, is mutability. The KV cache is a snapshot. It is correct for exactly the version of the knowledge you encoded, and the moment a document changes you have to recompute the cache to reflect it. RAG handles an edit by re-embedding one document and upserting one vector — incremental, cheap, online. CAG handles the same edit by rebuilding the snapshot. For knowledge that changes hourly, that is disqualifying; for a manual that changes quarterly, it is free.
That reframes the whole comparison. CAG is compile-time knowledge: you bake the corpus into a reusable artifact and amortize the cost across every query that artifact serves. RAG is runtime knowledge: you fetch fresh on each request and pay a small, constant cost to keep the store current. The question was never "retriever or no retriever." It is "is my knowledge static enough to compile?"
The catch CAG inherits
CAG does not escape the limits of long context — it leans on them. The same repo lists the constraint plainly: "the performance of LLMs may degrade with very long contexts." This is the lost-in-the-middle effect, where models recall facts at the start and end of a long context reliably but miss facts buried in the middle. RAG sidesteps this by only ever putting a handful of chunks in front of the model. CAG, by design, puts everything there — so as your "small, static" corpus grows toward the context limit, accuracy quietly erodes in the middle of the cache exactly where you cannot see it. CAG is strongest well inside the window, not at its edge.
The answer is usually both
The two approaches are not rivals so much as different stages of the same pipeline, and the strongest production systems compose them. Cache the small, hot, slow-changing core that every query touches — the schema, the policies, the canonical reference — and retrieve the large, volatile long tail. Work like TurboRAG, which precomputes KV caches for retrieved chunks, is already blurring the line from the RAG side, treating the cache as an accelerator for retrieval rather than a replacement.
So before you reach for another reranker to fix a retrieval miss, ask whether you needed retrieval at all. If the knowledge fits and holds still, the most reliable architecture is the one with the fewest moving parts — and CAG has one fewer than RAG. This is the same instinct that runs underneath RAG vs long context, contextual retrieval, and prompt caching for agents: the cheapest hop is the one you delete. Just make sure you can afford to recompile.



