For everyone who serves a model rather than just prompts one, the attention mechanism is not an abstraction — it is a line item. Every architecture choice from Multi-Head Attention onward has been an argument about one scarce resource: the key-value cache, the running memory of every token the model has already seen. The four acronyms in the title are four answers to the same question, and the most recent one quietly changed what the question was.

The cache is the bill

Start with the formula, because the whole story lives inside it. The KV cache, in bytes, is roughly:

2 × layers × kv_heads × head_dim × tokens × batch × bytes_per_element

The leading 2 is for keys and values. Everything else multiplies. The two terms a model architect can actually move are kv_heads and, indirectly, head_dim. Sequence length and batch size are set by your workload; layers and head dimension are mostly set by the model's scale. So when you want to serve longer context or larger batches without buying more accelerators, kv_heads is the knob within reach.

This is why the cache, not the weights, is so often the thing that runs out first. During decode, each new token forces a re-read of every past token's K and V from memory — which is the deep reason LLM inference has two speeds, a compute-bound prefill followed by a memory-bound crawl. Multi-Head Attention (Vaswani et al., 2017) sets kv_heads equal to the query head count, so it is the most expensive possible setting of that knob. GPT-3 and the original Transformer used it. It is the baseline that everything since has tried to undercut.

Sharing: MQA and GQA

The first cut was blunt. In 2019 Noam Shazeer's Fast Transformer Decoding: One Write-Head is All You Need introduced Multi-Query Attention: keep all the query heads, but give them a single shared K and V head. Set kv_heads = 1 in the formula and the cache shrinks by the full head-count factor. PaLM, Falcon-7B, and StarCoder took the deal. The catch was right there in the paper — "minor quality degradation" is the author's phrase, and in practice MQA's collapse to one KV head is enough to show up on benchmarks and to destabilize some training runs.

Grouped-Query Attention (Ainslie et al., 2023) is the compromise that stuck. Instead of one shared KV head or one per query head, you pick an intermediate number — query heads are partitioned into groups, and each group shares a KV head. Llama-3-8B uses 32 query heads and 8 KV heads, so four queries share each KV pair and the cache is one-quarter the MHA size. The paper's other contribution was logistical and underrated: you can uptrain an existing MHA checkpoint into GQA using about 5% of the original pretraining compute, which is why GQA spread so fast. Llama 2 70B, all of Llama 3, Mistral, and Qwen2 ship it.

MQA and GQA both pay for memory with the same currency — representational capacity. Fewer KV heads is, by construction, less of the model.

That sentence is the whole limitation. Sharing is lossy on purpose. GQA picks a point on the curve where the loss is small; it does not escape the curve.


Compression: MLA

Here is the non-obvious move. Multi-head Latent Attention, introduced in DeepSeek-V2 (2024) and carried into DeepSeek-V3, does not share KV heads at all. It compresses them. The attention input is projected down into a single low-rank latent vector with a small dimension d_c, and that latent — not the full per-head K and V — is what gets cached. At attention time, the cached latent is projected back up into distinct keys and values for every query head.

The difference is structural, not incremental. Sharing throws capacity away to save memory, so memory and quality are chained together: spend less and you get less. Compression breaks that chain. You cache a small d_c-dimensional vector, but you still reconstruct full, per-head K and V — so the memory saving is decoupled from the head count. DeepSeek reported MLA cutting the KV cache 93.3% versus their dense 67B model, and matching or beating MHA quality rather than trading against it. The concrete per-token numbers are stark: a later analysis put DeepSeek-V3's MLA at roughly 70 KB per token, against about 516 KB for Llama-3.1 405B and 327 KB for Qwen-2.5 72B — both GQA models.

There is one genuine wrinkle, and it is worth knowing because it explains an odd-looking design. Rotary position embeddings (RoPE) don't compose cleanly with the low-rank compression — the position-dependent rotation can't be absorbed into the up-projection matrices the way the rest of the math can. DeepSeek's fix is decoupled RoPE: a small set of extra query dimensions and a shared key that carry the rotary signal separately, sitting alongside the compressed latent. It is a patch, but a cheap one, and it is the price of admission for compressing K and V at all.

Where this leaves you

The practical reading is short. If you are choosing an open-weights model today, GQA is the safe, well-understood default and the reason a modern 8B serves at long context without melting. If you are watching where the frontier is moving, it is moving toward compression — MLA is the first widely-deployed attention variant that does not ask you to trade quality for cache, and that is why it spread from V2 to V3 and into the architectures chasing them.

It also reframes the adjacent optimizations. MLA shrinks the shape of the cache; KV cache quantization shrinks the bytes per element in the same formula, and the two compose. And because MLA's gains depend on a custom cached representation, your serving stack matters more than usual — a runtime has to implement the latent path to realize the saving, not just load the weights. The acronym you pick is, in the end, a statement about which term of that one formula you've decided to attack.