Every token a model generates, it generates against its whole past — and to avoid recomputing that past at every step, it keeps the keys and values of every prior token in the KV cache. For a short chat this is a rounding error. For a long context it is the dominant line on the memory bill: the vLLM team's PagedAttention paper measured a 13B model on a 40GB A100 spending about 65% of memory on static weights and roughly 30% on the dynamic KV state — and that 30% climbs with every extra token and every concurrent request. There are a few ways to shrink it. You can store each entry in fewer bits (quantization), share it across heads (MQA, GQA, MLA), or move the cold part off-GPU (offloading). Or you can throw some of it away. That last one is eviction, and three of the four methods everyone benchmarks are flavors of it.
The three evictors#
StreamingLLM (ICLR 2024) starts from a strange observation: the first few tokens of any sequence soak up a wildly disproportionate share of attention regardless of what they say. Call them attention sinks. A naive sliding window collapses the moment those sink tokens scroll out of view; StreamingLLM just pins them — keep ~4 sink tokens plus a rolling recent window — and a finite model streams stably across millions of tokens, with the paper reporting stability to roughly 4M. The catch is right there in the design: everything between the sinks and the recent window is gone. It does not extend what the model can recall; it extends how long it can run.
H2O (NeurIPS 2023) is smarter about what it keeps. It notices that a small set of "heavy hitter" tokens accounts for most of the attention mass, and maintains a fixed budget of recent tokens plus those heavy hitters, evicting the lowest accumulated-attention token at each step. It reports large throughput gains at roughly a 20% cache budget. But the score that decides a token's fate is its past attention, summed over queries that have already happened.
SnapKV (NeurIPS 2024) compresses the prompt rather than the generation. At the end of prefill it uses a small observation window at the prompt's tail to vote, per attention head, on which earlier prompt positions matter, and discards the rest before decoding begins. It is excellent for the long-prompt, short-answer shape — feed it a giant document, ask one question — and it is, by construction, a one-time bet made before the model has written a single token of its answer.
Eviction is a bet that you already know which tokens you will never need again. A long-running agent is a machine for losing that bet.
Quest, and the fork that actually matters#
The real division here isn't between the three eviction policies. It's between evicting and selecting. Quest (ICML 2024) refuses to delete anything. It keeps the full KV cache resident, splits it into pages, stores each page's min/max key vectors, and at every decode step uses the current query to estimate which pages could possibly matter — then loads only the top-K of them into attention. It reports up to a 7x self-attention speedup and a 2.2x end-to-end latency reduction with negligible accuracy loss on long-dependency tasks.
Quest saves bandwidth, not memory — the whole cache still sits in VRAM. So why bother, if eviction is strictly cheaper? Because of the one sentence that indicts all three evictors: in Quest's framing, a token's criticality highly depends on the query. H2O ranks tokens by the attention they have already received. SnapKV ranks them by a window fixed before generation starts. Both are guessing which tokens a future query will want, using only the queries that have already arrived. When the guess is wrong, the token isn't down-weighted — it's deleted. Selection keeps it around to be re-judged when the query that needs it finally shows up.
Why this is an agent problem, not just a serving problem#
For a chatbot answering one question over a long document, eviction's recency bias is mostly fine; the relevant context is usually near the question. For a long-running agent, it is close to a worst case. The tokens an eviction policy discards first — the system prompt, the tool schemas, the original task spec, the decision made forty steps ago — are precisely the tokens the agent loops back to hundreds of turns later. The policy optimized for "what was attended to recently" silently amputates the instructions the agent will need to finish the job, and the failure doesn't look like an out-of-memory error. It looks like the agent quietly getting dumber the longer it runs — forgetting a constraint, re-deriving a fact, drifting off its own plan.
The 2026 literature has converged on this exact diagnosis. CompressKV argues that flat, all-head eviction "evicts critical tokens and degrades performance" and scores tokens by specialized retrieval heads instead. DefensiveKV reports bluntly that "unprotected eviction can destroy retrieval performance." IntentKV builds a cross-turn cache specifically for agentic, multi-turn inference. Three different papers, one shared lesson: the cheap win of deleting tokens has an expensive, invisible tail.
So the practical question isn't "which eviction policy." It's "evict at all?" Diagnose your bottleneck first: if you're truly memory-bound, quantize or offload before you delete, because those are recoverable and eviction is not. If you're bandwidth-bound on decode, selection buys you the speed without the amnesia. And if you're building something that revisits its own history — which is to say, an agent — treat permanent eviction as a last resort and test it where it actually fails: on multi-turn, full-recall work, not on summarization. The cheapest token is the one you didn't have to recompute. The most expensive one is the one you threw away and turned out to need.



