Every long-running agent eventually hits the same wall. The conversation, the tool outputs, the half-finished scratch work — it all accumulates in the context window until there is no room left to think. So the agent compresses its own history: it summarizes what came before and throws the raw tokens away. We've written about the mechanisms before — context editing, compaction, the memory tool. This piece is about the trigger. When, exactly, should the agent pull that lever?

The default answer, baked into most coding agents and harnesses, is a number. When accumulated tokens cross some threshold — 70% of the window, say — fire the compaction step. It's simple, it's predictable, and a June 2026 paper argues it is the wrong question entirely.

The counter measures the wrong thing#

The argument in Self-Compacting Language Model Agents is deceptively plain: a token counter measures the size of the context, but the cost of compacting is structural, not numeric.

Think about what a threshold trigger actually does. It watches a number tick up and, at some arbitrary boundary, interrupts whatever the model is doing to summarize and discard. The number knows nothing about what the model is doing at that moment. It doesn't know whether the agent just closed out a sub-task cleanly — a safe moment to forget the details — or whether it's three steps into a delicate derivation with partial results scattered across the last few turns.

The token count tells you the context is full. It cannot tell you that forgetting is safe. Those are different facts, and only one of them should pull the trigger.

When the threshold fires mid-derivation, the summary it produces is lossy in exactly the wrong place. The model has to reconstruct the partial work it just did — re-deriving the intermediate result, re-reading the file it had already parsed, re-establishing the state it had built up. You "saved" tokens by compacting, then spent more tokens climbing back to where you were. On a hard task, a clock-based trigger can make an agent both slower and less accurate while the dashboard reports a tidy reduction in context size.

A rubric instead of a counter#

SelfCompact's move is to hand the decision to the model. It pairs two inference-time pieces, neither of which requires fine-tuning:

The reframing is the whole idea. Compaction stops being a maintenance interrupt the platform schedules and becomes a judgment call the agent makes — because the agent is the only party that can see the structure of the work, not just its byte count. Forgetting is a decision about what is safe to forget, and that is a question about meaning, not memory pressure.

Cheaper and more accurate at once#

Here is the result that should make you look twice. In most systems, cost and quality trade off: you can spend more tokens to be more accurate, or fewer to be cheaper. Self-compaction reportedly improves both.

Against a no-summarization baseline, the paper reports gains of up to 18.1 points on math and 5 to 9 points on agentic search — while running at 30 to 70% lower cost per question than fixed-interval summarization. That held across six benchmarks and seven different models, with no fine-tuning and no external supervision.

The mechanism behind the free lunch is intuitive once you accept the structural-cost framing. A clock recompacts whether or not the agent needs it, paying for summaries nobody asked for and occasionally kneecapping a derivation. A model that compacts only at safe boundaries does it less often and better — fewer summaries, none of them landing mid-thought. The cost savings and the accuracy gains come from the same source: not compacting at the wrong time.

The catch worth naming#

None of this makes the threshold obsolete by fiat, and the honest version of the story has an edge case the rubric itself flags. "Suppress when stuck" is doing a lot of work. An agent that misjudges stuck for converging can compact away the breadcrumbs it needed; an agent that's too cautious can sit on a bloated context and blow the window anyway. The model's self-assessment is now load-bearing, and self-assessment is not a solved problem.

The pragmatic read: keep a hard threshold as a backstop — a ceiling the model is not allowed to cross — but let the model make the normal call below it. That mirrors the broader context-engineering consensus, where compression is one of four levers (write, select, compress, isolate) rather than a single panic button. The threshold becomes the seatbelt, not the steering wheel.

If you're building a long-horizon agent today, the cheap experiment is to stop treating compaction as plumbing. Expose it as a tool, give the model a two-line rubric, and keep your old token ceiling as a guardrail. The surprising finding of 2026 is that the agent, asked politely, is a better judge of when to forget than your counter ever was.