For a model that only answers one question at a time, the choice between a Transformer and a state-space model is mostly academic. For an agent — a process that holds a growing transcript of tool calls, retrieved documents, and its own reasoning across hours — it is a hardware bill. And the bill is not paid in benchmark points. It is paid in bytes of memory per token and in how fast the model can emit the next token when the context is already 100,000 tokens deep.

That is the lens that makes "Mamba vs Transformer" decidable. Read it that way and the headline result of the last two years is not the one most people expected.

The two cost curves

A Transformer attends over every previous token, so during decode it caches a key and value vector for each one. That KV cache grows linearly with sequence length — it is the same line item behind why long context degrades and behind every attention variant from MQA to MLA. The concrete number: AI21 reports Mixtral-8x7B needs roughly 32GB of KV cache at 256K tokens. That memory has to be re-read for every token generated, so throughput sags exactly when the context is longest.

A state-space model takes the opposite shape. Mamba (Gu & Dao, 2023) compresses the past into a fixed-size recurrent state and carries no KV cache at all. The state is the same size at token 1,000,000 as at token 10. The paper's claim follows directly: ~5x higher generation throughput than a same-size Transformer, and linear-time scaling to million-token sequences. Mamba-2 (Dao & Gu, 2024), built on a "state space duality" that formally links SSMs and attention, made the core layer 2-8x faster again while enlarging the internal state from 16 to 256 dimensions.

So on the only two axes an agent operator cares about — memory-at-context and decode throughput — the SSM wins cleanly. Which raises the obvious question: why does every frontier lab still ship attention?

What a constant-size state cannot do

Because a fixed state is a lossy summary, and some tasks need a lossless lookup. The failure is sharp and well-documented: pure SSMs are weak at exact in-context recall — retrieving a specific earlier token, copying a string verbatim, and multi-query associative recall (MQAR), the synthetic stress test for "find the value paired with this key." Mamba handles single-query recall but degrades on MQAR at limited model width; weaker SSMs fail outright.

The fix is almost embarrassingly cheap. In published experiments, adding a single self-attention layer to an eight-layer Mamba enabled perfect copying of length-50 strings and generalization to length-100 — a capability the pure stack simply did not have. Attention is the model's random-access memory; the SSM is its running compression. An agent that pastes a tool's JSON output and needs the exact field back later is precisely the recall workload SSMs are bad at, which is why "pure Mamba for agents" was always the wrong framing.

The frontier did not pick a side. It found the ratio — the smallest number of attention layers that buys back exact recall while the rest of the network stays cheap.

The hybrids that actually shipped

This is the non-obvious result. Pure Mamba did not win; hybrids did, and they converged on a strikingly similar recipe: keep roughly one attention layer for every eight to twelve, make everything else a Mamba-2 or linear-attention block.

Five labs, one idea: attention is a scarce, expensive ingredient you sprinkle, not the whole dish.

What it means for an agent builder

Strip the hype and the practical guidance is narrow but firm. If your agent operates at short context and modest batch, a tuned Transformer is fine — and your bigger long-context lever is often retrieval, the RAG-vs-long-context tradeoff, not the architecture. But if it runs long, holds 100k+ token transcripts, and you serve many sessions at once, a hybrid is the bet: memory stays roughly flat as the context grows, decode throughput holds, and the retained attention layers preserve the exact recall your tool outputs depend on.

Note what is not the deciding number. None of these models won on MMLU by a margin that would matter to anyone. They won on the curves — constant memory, sustained throughput — that only show up when the sequence is long and the run is real. For an agent, those are the only curves there are. The right question for 2026 is no longer "Mamba or Transformer." It is "how few attention layers can you get away with?"