The headline number on every frontier model is its context window, and the numbers keep getting more absurd: 200K, a million, ten million tokens. The implied promise is that you can stop engineering retrieval and just paste everything in. The research of the last two years says the opposite, and it says it consistently enough that "just use a big window" should now sound, to anyone building agents, like "just add more cooks."
The uncomfortable finding has a name — context rot — and a simple definition: a model's accuracy degrades as its input grows, even when the answer is right there and the task is trivial. The advertised window is a capacity limit, not a quality guarantee. A million-token context is not a million usable tokens.
The evidence is boringly consistent
Four independent results, different teams, same shape.
Chroma's 2025 Context Rot report ran 18 frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3 — and watched performance fall as token count rose, "even on simple tasks." Two refinements matter for builders. First, degradation accelerates as the semantic similarity between the query and the target text drops: when the answer doesn't share vocabulary with the question, length hurts more. Second, and more damning, distractors compound the damage non-uniformly — a single plausible-but-wrong passage lowers accuracy, four lower it more, and some distractors are far more "distracting" than others. Length alone isn't the enemy; length full of near-misses is.
The original crack in the "long context just works" story was Liu et al.'s 2023 Lost in the Middle. Models use information best at the very start (primacy) or very end (recency) of the window and worst in the middle — a U-shaped curve. The number that should stop you: when the answer was buried mid-context, GPT-3.5's accuracy fell below its closed-book baseline of 56.1%. Read that again. The model did worse with the documents than with no documents at all. The failure wasn't "couldn't retrieve." It was that irrelevant surrounding tokens degraded the reasoning the model was otherwise capable of.
Then there's the needle-in-a-haystack problem — the test everyone cites to claim their model "aced 1M tokens." NoLiMa (2025) showed why that test flatters: standard needle tests can be solved by literal keyword matching between the question and the planted sentence. Strip that overlap, forcing actual associative reasoning, and the floor drops out: at 32K tokens, 10 of 12 models scored below 50% of their short-context baseline, and even GPT-4o fell from 99.3% under 1K tokens to 69.7% at 32K. NIAH measures retrieval; agents need reasoning, and the gap between them widens with every token.
Finally, RULER (NVIDIA, 2024) put a tape measure on the marketing. Judged against realistic multi-hop and aggregation tasks rather than lexical lookups, "almost all models fall below the threshold before reaching the claimed context lengths." Models advertising 128K were often effective only to 32K or 64K. Effective context is a fraction of advertised context, and you should assume the fraction, not the headline.
Why this happens, briefly
It's not a mystery bug; it's the attention mechanism doing exactly what it does. As Anthropic's context-engineering guidance puts it, every token must attend to every other, producing n² pairwise relationships, and "this ability gets stretched thin" as the window fills. The window is, in Anthropic's framing, "a precious, finite resource" — and the model's ability to recall from it "decreases" as you spend it. Drew Breunig's taxonomy is a useful field guide to the failure modes: poisoning (a hallucination that persists and keeps misleading), distraction (the "context rot" case), confusion (irrelevant tokens pulling attention), and clash (conflicting instructions). They all reduce to the same lever.
The lever is signal-to-noise, not size
If the bottleneck is usable context rather than capacity, then the engineering goal flips. You are not trying to fit more in; you are trying to raise the ratio of high-signal to low-signal tokens at the moment of inference.
The win condition isn't a bigger window. It's the smallest set of tokens that still contains the answer.
In practice, for an agent: retrieve just-in-time rather than front-loading everything — which is exactly the case for retrieval over long context even as windows balloon. Compact the running history instead of appending to it forever; a rolling summary beats a verbatim transcript. Isolate unrelated subtasks into sub-agents with their own clean windows so one task's clutter doesn't rot another's. Position your highest-priority instructions where the model actually attends — near the top or bottom, not the soft middle. And measure with low-overlap, distractor-laden evals, because a clean needle-in-a-haystack score is telling you almost nothing about how your agent will behave on real, messy input. The discipline behind all of this is context engineering, and it's becoming the difference between a demo and a system.
The bigger windows are real and genuinely useful — for the cases that need them. But the reflex they invite, stop curating and just paste it all, is the one the data most firmly rejects. The model will read everything you give it. That was never the question. The question is whether it can still think once you have.



