For four years the most load-bearing assumption in language-model serving was that generation is sequential. A model emits one token, conditions on it, emits the next; the latency of a long answer is the length of the answer times the cost of a step, and you cannot start token five hundred until you have token four hundred and ninety-nine. Diffusion language models propose to break that assumption entirely — generate the whole sequence at once and refine it — and the obvious conclusion is that they must therefore be faster. The obvious conclusion was wrong, and the story of why is more interesting than the pitch.

Two ways to write a sentence

An autoregressive model (AR) — every GPT, Llama, Claude, and Qwen you have used — generates left to right under a causal mask: each token attends only to what came before it. That constraint is also a gift, and we will get to why.

A diffusion LLM (dLLM) works like image diffusion adapted to discrete tokens. It starts from a sequence that is entirely masked — placeholders all the way down — and over a series of denoising steps it unmasks tokens, predicting many positions at once, each conditioned on context from both sides via bidirectional attention. LLaDA, the 8B open model that made the approach credible in early 2025, is "a Transformer as the mask predictor" with no causal mask, trained from scratch and reported competitive with LLaMA3-8B on standard benchmarks. The mental image the vendors like: instead of writing a sentence word by word, you sketch the whole thing blurry and sharpen it until it reads.

The promise, and the benchmark that refused to cooperate

If you generate every token in parallel, a response should take a fixed number of denoising steps regardless of length — call it constant-ish time instead of linear. So early dLLMs should have crushed AR on throughput.

They didn't. The uncomfortable, well-documented fact is that the first open diffusion models — LLaDA, Dream — were frequently slower than autoregressive models of similar quality. The D2F paper states it plainly: dLLMs "suffer from slower inference than autoregressive models due to incompatibility with standard KV cache and limited parallelization." The thing that was supposed to be their advantage came with a hidden bill.

The KV cache is the whole game, and diffusion can't pay

Here is the gift the causal mask gives AR, the one nobody mentions when they pitch diffusion. Because an AR token attends only to earlier tokens, once a token is produced its key and value vectors are frozen — nothing later can change them. So you compute them once and cache them. Every subsequent token reuses the entire cached history; each new step does work proportional to one token, not the whole sequence. The KV cache is not a nice optimization on top of AR decoding. It is what makes AR decoding cheap, and the entire modern serving stack — paged attention, continuous batching, speculative decoding — is scaffolding around it.

Diffusion's bidirectional attention throws that gift away. When you unmask even a single token, it can change the representation at every other position, because everything attends to everything. There is nothing stable to cache. LLaDA's own paper concedes it uses vanilla multi-head attention "as LLaDA is incompatible with KV caching." So a vanilla dLLM does the brutal thing: it re-runs a full forward pass over the entire sequence at every denoising step. Its cost scales with sequence length times the number of steps. Parallel-in-principle, expensive-in-practice. You traded AR's "one cheap step per token" for "many full-sequence passes," and unless your step count is very low, you lose — and if you cut steps too aggressively to win the race, quality collapses, because you are committing too many tokens per pass with too little refinement.

The diffusion pitch was "stop being autoregressive." The fix that made diffusion fast was "be a little autoregressive again."

What actually made it fast: putting the sequence back

The speedups that matter all came from reintroducing AR structure, not removing it.

That tension — more tokens unmasked per step buys speed and spends quality — is the permanent knob of diffusion decoding, the same way the throughput/latency knob is permanent for continuous batching in AR serving.

Where the commercial numbers come from

The headline throughput figures are real and they are also marketing, and you should hold both thoughts. Inception Labs' Mercury Coder reports on the order of 737–1,109 tok/s on H100s, with independent benchmarking (Artificial Analysis) clocking it over 1,000 tok/s, roughly 5× speed-optimized AR frontier models, and tying mid-tier AR models on Copilot Arena. Google's Gemini Diffusion, an "experimental research model" from I/O 2025, was reported around 1,479 tok/s average (that figure excludes a sub-second startup). These are vendor or vendor-adjacent numbers on workloads they chose — fast, genuinely, but not peer-reviewed quality-controlled throughput comparisons. Tellingly, both commercial wins are in code, a domain where structure is rigid and the parallel-refinement story works best.

Mid-2026, honestly

dLLMs stopped being a curiosity. For latency-critical, structured work — code generation, tight agent tool-use loops where time-to-completion dominates — the commercial diffusion models are fast and quality-competitive, and worth a real benchmark on your workload. As a drop-in for the serving stack you already run, not yet: by giving up the causal mask, diffusion gives up AR's clean KV-cache memory model, and production serving is still an open systems problem — late-2025 work is already fighting a dLLM-specific "memory footprint crisis." The lasting lesson isn't "diffusion beats autoregressive." It's that the KV cache was load-bearing all along, and any architecture that wants to be fast at scale has to earn its way back to something that caches.