For a model that only answers one question at a time, the choice between a Transformer and a state-space model is mostly academic. For an agent — a process that holds a growing transcript of tool calls, retrieved documents, and its own reasoning across hours — it is a hardware bill. And the bill is not paid in benchmark points. It is paid in bytes of memory per token and in how fast the model can emit the next token when the context is already 100,000 tokens deep.
That is the lens that makes "Mamba vs Transformer" decidable. Read it that way and the headline result of the last two years is not the one most people expected.
The two cost curves
A Transformer attends over every previous token, so during decode it caches a key and value vector for each one. That KV cache grows linearly with sequence length — it is the same line item behind why long context degrades and behind every attention variant from MQA to MLA. The concrete number: AI21 reports Mixtral-8x7B needs roughly 32GB of KV cache at 256K tokens. That memory has to be re-read for every token generated, so throughput sags exactly when the context is longest.
A state-space model takes the opposite shape. Mamba (Gu & Dao, 2023) compresses the past into a fixed-size recurrent state and carries no KV cache at all. The state is the same size at token 1,000,000 as at token 10. The paper's claim follows directly: ~5x higher generation throughput than a same-size Transformer, and linear-time scaling to million-token sequences. Mamba-2 (Dao & Gu, 2024), built on a "state space duality" that formally links SSMs and attention, made the core layer 2-8x faster again while enlarging the internal state from 16 to 256 dimensions.
So on the only two axes an agent operator cares about — memory-at-context and decode throughput — the SSM wins cleanly. Which raises the obvious question: why does every frontier lab still ship attention?
What a constant-size state cannot do
Because a fixed state is a lossy summary, and some tasks need a lossless lookup. The failure is sharp and well-documented: pure SSMs are weak at exact in-context recall — retrieving a specific earlier token, copying a string verbatim, and multi-query associative recall (MQAR), the synthetic stress test for "find the value paired with this key." Mamba handles single-query recall but degrades on MQAR at limited model width; weaker SSMs fail outright.
The fix is almost embarrassingly cheap. In published experiments, adding a single self-attention layer to an eight-layer Mamba enabled perfect copying of length-50 strings and generalization to length-100 — a capability the pure stack simply did not have. Attention is the model's random-access memory; the SSM is its running compression. An agent that pastes a tool's JSON output and needs the exact field back later is precisely the recall workload SSMs are bad at, which is why "pure Mamba for agents" was always the wrong framing.
The frontier did not pick a side. It found the ratio — the smallest number of attention layers that buys back exact recall while the rest of the network stays cheap.
The hybrids that actually shipped
This is the non-obvious result. Pure Mamba did not win; hybrids did, and they converged on a strikingly similar recipe: keep roughly one attention layer for every eight to twelve, make everything else a Mamba-2 or linear-attention block.
- NVIDIA Nemotron-H / Nemotron Nano 2 set self-attention to about 8% of layers — 4 of 52 in the 8B, 10 of 118 in the 56B — and report up to 3x inference throughput over same-size pure Transformers like Qwen-2.5 and Llama-3.1.
- IBM Granite 4.0 (October 2025) runs a roughly 9:1 Mamba-2-to-transformer mix with MoE, claiming 70%+ lower memory for long-context and multi-session inference and ~2x faster inference.
- AI21 Jamba uses a 1:7 attention-to-Mamba ratio; the payoff is the headline contrast — 256K tokens in a 4GB KV cache versus Mixtral's ~32GB, an 8x reduction, with 3x the throughput at long context and the model fitting on a single GPU.
- TII Falcon-H1 runs attention and SSM in parallel within each block and concatenates outputs, letting the attention/SSM head ratio be tuned independently. Qwen3-Next keeps ~25% standard attention with the rest Gated DeltaNet, reporting throughput beyond 10x of Qwen3-32B past 32K context.
Five labs, one idea: attention is a scarce, expensive ingredient you sprinkle, not the whole dish.
What it means for an agent builder
Strip the hype and the practical guidance is narrow but firm. If your agent operates at short context and modest batch, a tuned Transformer is fine — and your bigger long-context lever is often retrieval, the RAG-vs-long-context tradeoff, not the architecture. But if it runs long, holds 100k+ token transcripts, and you serve many sessions at once, a hybrid is the bet: memory stays roughly flat as the context grows, decode throughput holds, and the retained attention layers preserve the exact recall your tool outputs depend on.
Note what is not the deciding number. None of these models won on MMLU by a margin that would matter to anyone. They won on the curves — constant memory, sustained throughput — that only show up when the sequence is long and the run is real. For an agent, those are the only curves there are. The right question for 2026 is no longer "Mamba or Transformer." It is "how few attention layers can you get away with?"



