The pitch was clean. Once a model can hold a million tokens, you stop chunking documents, stop running an embedding model, stop standing up a vector database, stop tuning a reranker. You take your corpus, you pour it into the prompt, and the model reads everything. Retrieval was scaffolding for the era of small windows; the windows got big; tear the scaffolding down.
Two years into that promise, the benchmarks have come back, and they tell a more useful story than either side wanted. Long context did not kill retrieval-augmented generation. It changed what you are choosing between — and the honest framing is not retrieve versus stuff but which failure mode can you afford to debug.
The window does not read evenly
Start with the finding that should have ended the "RAG is dead" conversation and somehow didn't. In Lost in the Middle, Liu and colleagues put the same relevant fact at different positions inside a long prompt and measured whether the model used it. Performance traced a U: models answered well when the needed information sat at the very beginning or the very end of the context, and sagged by as much as twenty points when the same fact was buried in the middle. The information was present. The model had been given it. It read past it anyway.
That is the whole problem in one curve. A retrieval system that drops a document fails in a way you can see: the answer is missing a fact, the citation isn't there, often the model says it doesn't know. A long-context model that glides past a fact buried at token 400,000 fails in a way you cannot see: it returns a fluent, confident, complete-looking answer that is simply wrong, and nothing in the output flags which of the million tokens it actually used.
Retrieval fails loudly and long context fails silently — and a silent failure in production is the expensive kind, because you find it after the user does.
Bigger windows, new failure modes
Databricks ran the cleaner version of the experiment: twenty models, real corpora, context dialed from 2,000 tokens up past 128,000. More retrieved context helped — until it didn't. Llama-3.1-405B started degrading past roughly 32k tokens; GPT-4-0125 held on to about 64k before sliding. And the failures got weirder as the window filled. Claude 3.5 Sonnet's rate of refusing to answer — on copyright grounds — climbed from 3.7% at 16k tokens to 49.5% at 64k. One model simply stopped following instructions as the context grew. These are not the failures of a system that is running out of room. They are the failures of a system being asked to attend to more than it comfortably can, and the only signal you get is a quietly worse answer.
This is why the much-cited "99.7% needle-in-a-haystack recall" number is a trap. Finding one planted sentence in a huge document is a measure of whether the window can surface a token. It tells you almost nothing about whether the model will integrate the seventh relevant fact out of forty while ignoring three plausible distractors — the thing a real agent query actually requires. The NoLiMa benchmark made the gap brutal by stripping the literal word-overlap that lets a model cheat: force it to find a fact by meaning rather than by matching keywords, and GPT-4o's score falls from 99.3% to 69.7% by 32k tokens. Measured by where a model still holds 85% of its short-context quality, NoLiMa puts GPT-4o's effective window around 8k tokens — against an advertised 128k. The advertised number is the size of the room; the effective number is how far into it the model can still see.
What each side is actually good at
None of this makes long context a mistake. Google's own head-to-head — bluntly titled Retrieval Augmented Generation or Long-Context LLMs? — found that when you can afford it, long context consistently outperforms RAG on average quality. Stuffing the window keeps the model from being starved by a bad retrieval step, and it lets the model reason across the gaps between documents — to notice what a chunk-retriever would never retrieve because the answer is an absence, not a passage.
But "when you can afford it" is doing enormous work, and the same paper quantified the out. For more than 60% of queries, RAG and long context produced identical predictions — the expensive option bought nothing. Their Self-Route method, which lets the model itself decide whether a query needs the full window, cut cost by 65% on Gemini-1.5-Pro and 39% on GPT-4o while holding quality near the long-context ceiling. The lesson is not "pick one." It is that paying long-context prices on every query is paying a premium that, most of the time, changes no answer.
And there are jobs the window structurally cannot do. Per-user access control lives in the retrieval step — you filter to the documents this user is allowed to see before anything reaches the model; a loaded context window has no native sense of who is asking. Freshness lives there too: an index updates on a webhook, a prompt is frozen the moment you build it. Citations, provenance, "show me the source" — all of it is a property of having retrieved discrete things, not of having read everything at once. Anthropic, whose models ship some of the largest windows available, spent September of 2024 publishing Contextual Retrieval — a technique for making RAG fail less, reducing retrieval misses by up to 49%. Companies do not optimize the thing they believe is dead.
The actual decision
So stop asking which one wins. Ask what you can pay for and what you can see.
Use the window when the corpus is bounded and small enough to fit, when the task is synthesis across the whole set, when reasoning about what's missing matters, and when you can absorb the per-query cost. Use retrieval when the corpus is larger than any window, when documents carry per-user permissions, when the data changes under you, when you need to show your sources — and when you would rather your system fail by saying I couldn't find it than by saying something wrong with total confidence.
For most production agents, that last clause settles it. A retrieval miss is a bug report. A million-token model reading past the one line that mattered is an incident nobody opens, because the answer looked fine. Choose the failure you can find.



