There is a rule of thumb in LLM serving that has held for two years: a longer prompt is more expensive than a shorter one. On a Mamba-hybrid model it can be exactly backwards. A 552-token prompt can serve at more than double the throughput of a 479-token one — same model, same hardware, same request shape — because of where a single cache boundary happens to fall. The number behind it is 528, and it's worth understanding before you put a hybrid model on a hot path.

Prefix caching assumes something Mamba doesn't provide#

Prefix caching — reusing the compute for a prompt prefix that many requests share — is the highest-leverage trick in modern serving. It's the mechanism behind RadixAttention, and it's why a system prompt or a shared RAG preamble is nearly free after the first request. It works because a transformer leaves a key/value entry for every token. Two requests that share the first 400 tokens share 400 KV entries; the second request just points at them.

A state-space layer breaks that assumption. Mamba doesn't store per-token keys and values — it folds the entire prefix into one recurrent state vector and carries it forward. There is no per-token artifact to reuse. The only thing you can cache is a checkpoint of the state at some boundary, which is a fundamentally coarser unit than a KV block. So on a hybrid model — attention layers interleaved with Mamba layers — the cache manager has to reconcile two granularities: fine per-token KV for the attention layers, and coarse state checkpoints for the Mamba ones.

The 528-token block, and the cliff it creates#

vLLM reconciles them by making the coarse one win. On a Mamba-hybrid like Qwen3.5, it sets the attention block size to 528 tokens so the attention page is at least as large as the Mamba page and the two managers stay aligned. That's a reasonable engineering choice with a sharp edge: prefix caching only reuses fully-completed blocks. A prompt that doesn't fill one 528-token block completes zero blocks, so it gets a ~0% cache hit and is recomputed from scratch.

The measured hit rates, from vLLM issue #40696 on Qwen3.5-4B, trace the cliff precisely:

The throughput consequence is not a gentle slope. The same report notes QPS dropping from ~200 to under 100 when the prompt shrank from ~560 to ~480 tokens — a shorter prompt running at half the rate, purely because it fell below the block boundary.

On a transformer, cost rises with prompt length. On a Mamba-hybrid, cost can jump down a cliff as the prompt crosses 528 tokens — the cache boundary, not the token count, sets the price.

Why this lands where hybrids were supposed to win#

The cruel part is which workloads it hits. Long-context chat and document RAG sit comfortably above 528 tokens and cache fine. The prompts that fall off the cliff are the short, high-QPS ones — intent routing, classification, tool selection, guardrail checks — the exact latency-sensitive traffic where a lean hybrid model was supposed to be the cheap, fast choice. You adopt a hybrid to save money on a firehose of small requests, and the caching layer quietly hands you a 0% hit rate on all of them.

The fix is decoupling, not a bigger cache#

Throwing more cache memory at this doesn't help — the problem is alignment, not capacity. The real fixes separate the attention block size from the Mamba state alignment so short prompts can cache again. vLLM is building a Hybrid KV Cache Manager with all and align prefix-caching modes; its tracking issue #26201 has been open since October 2025 and now spans Mamba1/2, ShortConv, LinearAttention, and GatedDeltaNet. SGLang took the offload route: v0.5.13 and v0.5.14 made HiCache the default for hybrid models through its UnifiedTree and added an int8 checkpoint pool for the Mamba radix cache (June 26, 2026), storing recurrent states compactly so more of them fit.

The practical takeaway until those land everywhere: if you're serving a Mamba-hybrid, profile your prompt-length distribution against the block size. A histogram that clusters just under 528 tokens is a throughput problem hiding as a model choice — and it won't show up in a long-context benchmark, only on your short-prompt traffic. If you also run a self-hosted engine comparison, this is a dimension the standard benchmarks don't measure: how each one caches the architecture you actually deployed.