If you built a RAG pipeline in 2024, you already did context engineering. You just called it something else, and you probably did the most important part backwards.
The term arrived loudly in June 2025. Shopify's Tobi Lütke posted that he preferred "context engineering" to "prompt engineering" because it "describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM." Six days later Andrej Karpathy amplified it: "+1 for 'context engineering' over 'prompt engineering'," calling it "the delicate art and science of filling the context window with just the right information for the next step." The reframe stuck. By the time Anthropic published its engineering guide later in 2025, the framing was settled: prompt engineering is a subset of a larger problem.
Here is the part most pipelines get wrong. The instinct from the RAG era is to fill the window — more retrieved chunks, more history, more tool schemas, because the model has 200K tokens and unused capacity feels wasted. That instinct treats context as a bucket. It is not a bucket. It is a budget that degrades as you spend it.
The numbers say the window lies to you
The cleanest evidence is Chroma's 2025 Context Rot report. They ran 18 frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3 among them — and found that every single one got less reliable as input length grew. Crucially, this happened well before any model approached its context limit. Capacity available does not mean capacity usable.
This is not new physics; it is the maturation of a finding from 2023. Liu et al.'s "Lost in the Middle" (TACL 2024) showed model accuracy follows a U-shaped curve against the position of the relevant fact: strong at the start, strong at the end, and significantly worse when the needle sits in the middle of a long context. The same fact, moved 40,000 tokens inward, becomes harder to retrieve. Position is not free.
The mechanism is mechanical. Transformer attention is quadratic: every token attends to every other token. Chroma's framing makes the scale concrete — a 100,000-token context implies on the order of 10 billion pairwise relationships competing for the same finite attention. Anthropic puts the same idea in budget language: models have a finite "attention budget," and "every new token introduced... depletes this budget by some amount." Add a token and you do not just pay for that token — you dilute the signal of every token already there.
The scarce resource is not space in the window. It is the model's attention, and you spend it on every token you add, useful or not.
So the discipline is eviction, not accumulation
Once you accept that attention decays as the window fills, the job inverts. You are no longer maximizing what you put in. You are minimizing it — Anthropic's phrasing is finding "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." Curation, then eviction. Four moves do most of the work, and three of them are about removing context, not adding it.
Retrieve just-in-time, not just-in-case. The RAG reflex is to fetch everything plausibly relevant and stuff it in up front. Better: let the agent pull context when it needs it — a file path, a query, a lookup at the moment of use — so the window holds what the current step requires rather than what the whole task might. Lazy loading is a context strategy, not just a performance one.
Compact long histories. A multi-turn agent accumulates tool outputs and dead ends. Compaction — summarizing the trajectory so far and discarding the raw transcript — resets the density of the window. You keep the conclusions and evict the scaffolding that produced them. Anthropic ships this directly: its context-management tooling treats summarization and memory as first-class.
Isolate subtasks in sub-agents. Give a noisy subtask — a deep search, a long file scan — its own clean context window. The sub-agent burns its budget on the messy work and returns only the distilled result to the parent. The orchestrator's window never sees the 30,000 tokens of intermediate junk. Isolation is how you spend a budget you do not own.
Load tools just-in-time. Tool definitions are tokens too. A 40-tool agent that injects all 40 schemas on every call is paying attention tax on 39 tools it will not use this turn. Surface definitions when they become relevant.
What this means for your pipeline
If you came from RAG, the upgrade is not a new framework. It is a posture change. Stop optimizing recall — "did I retrieve everything relevant?" — and start optimizing density — "is every token in this window earning its place?" The two goals conflict more often than the marketing admits. A retriever that returns 20 mostly-relevant chunks can score better on recall and worse on task success, because the eight distractors rot the attention available to the twelve that mattered. Chroma's report names exactly this: semantically similar but irrelevant content actively misleads the model.
Prompt engineering was a search for the right string. Context engineering is operations on a living, shrinking budget — assemble, measure, evict, repeat, across every step of a running agent. The window is not the asset. The attention is. And unlike the window, it does not come back when you free up space; it only stretches thinner the more you ask of it.
Build accordingly. Curate hard, evict early, and stop confusing the size of the bucket with the amount of water the model can actually drink.



