The reflex, when an agent's context gets long, is to shrink it. Summarize the history. Compress the scratchpad. Keep the prompt tight. Almost every production playbook treats accumulated context as a liability to be managed down. A paper out of Stanford, SambaNova, and UC Berkeley argues the reflex is backwards — and has the benchmark numbers to make the case awkward to ignore.
The method is Agentic Context Engineering, or ACE. Its claim is narrow and concrete: you can make an agent meaningfully better at a task by editing the text it reads rather than the weights it runs on — and the editing strategy that wins is to let the context grow, as a structured, versioned playbook, instead of repeatedly rewriting it into something shorter.
The bug that masquerades as good hygiene
Start with the failure mode ACE is named after. When an agent maintains a running memory by re-summarizing it each step — the standard "keep a cheatsheet" pattern — every rewrite is a lossy paraphrase of the last. Detail leaks out a little at a time, then all at once. The paper documents a single update step where a context of roughly 18,000 tokens collapsed to 122, and task accuracy fell from 66.7% to 57.1% in that one move. The model didn't get dumber. Its notes got amnesia.
This is the part worth internalizing, because it inverts a habit. The thing we call "summarization" — the tidy, responsible-sounding step we add to keep token costs down — is also the thing quietly deleting the agent's hard-won specifics. ACE's authors call the underlying tendency brevity bias: optimization pressure toward shorter context strips the domain insight that made the context useful in the first place. The cure isn't a better summarizer. It's to stop summarizing as the default.
Generator, Reflector, Curator
ACE structures adaptation as a loop of three roles, building on the earlier "Dynamic Cheatsheet" line of work:
- Generator runs the task and emits the reasoning trace — what it tried,
in what order, where it stalled.
- Reflector reads that trace against the outcome and extracts the lesson:
this tactic worked, that assumption was wrong, this edge case bit us.
- Curator folds the lesson into a persistent playbook as a small
delta — a discrete, itemized entry — rather than rewriting the whole document.
The separation matters. By splitting judging what happened (Reflector) from deciding what to keep (Curator), ACE avoids the single-pass "rewrite my notes" step where collapse happens. New items are merged and de-duplicated deterministically — no LLM re-paraphrasing the entire context — so the playbook accretes knowledge the way a codebase accretes commits. The authors call this grow-and-refine: append mostly, prune occasionally, never blanket-rewrite.
Treat the agent's context like a version-controlled codebase, not a summary you keep re-paraphrasing.
The numbers that make it more than a blog post
On the AppWorld agent benchmark, ACE improved task performance by +10.6%, and on finance reasoning (FINER) by +8.6%, over strong context-adaptation baselines. The headline result is the one VentureBeat led with: an open model, DeepSeek-V3.1, equipped with an ACE-evolved playbook matched the top-ranked production agent on the AppWorld leaderboard — IBM's CUGA, powered by GPT-4.1 — and on the harder "challenge" split, edged ahead.
The efficiency story is the quieter, more practical one. Because updates are small deltas merged without a model in the loop, ACE cut adaptation latency by ~86.9% on average and needed fewer rollouts and lower token-dollar cost than methods that regenerate context wholesale. Adapting an agent stopped being an expensive batch job and became something closer to an incremental write.
Where it fits — and where it doesn't
ACE is not a fine-tuning killer, and pretending otherwise is the kind of hype this desk avoids. Fine-tuning still wins when you need a skill baked durably into weights, or when there's no clean feedback signal at inference time. ACE's whole engine runs on feedback: the Reflector needs to know whether the last attempt actually worked. Point it at a task with no execution result, no unit test, no ground truth, and it will faithfully reinforce whatever it guessed — garbage in, playbook out. It shines precisely where outcomes are checkable: coding agents, tool-use trajectories, anything with a pass/fail oracle.
But as a default stance for long-running agents, the lesson lands. We have spent two years building elaborate machinery to compress context and fight context rot. ACE suggests a chunk of that effort optimizes the wrong variable. The question for your agent isn't "how do I keep the context small?" It's "how do I keep the context correct as it grows?" — which is a problem software already solved with diffs, dedup, and version history.
If you're choosing between adaptation strategies — fine-tuning, prompt optimization, or agent memory — ACE is the argument for a fourth option you may have been compressing out of existence. The model was never the bottleneck. Its notes were.



