Open any LLM serving dashboard and you will be handed a single number — tokens per second — as if generation were one smooth pipe. It isn't. Every request you send is two different machines wearing the same costume, and they want opposite things from the hardware. Almost everything interesting that has happened in inference serving over the last three years is the field slowly, reluctantly admitting that fact and acting on it.

Two phases, two bottlenecks

When a request arrives, the model first runs prefill: it reads your entire prompt in a single forward pass, builds the key/value cache for every token, and produces the first output token. Because all the prompt tokens go through at once, prefill is a big, dense matrix multiplication — it saturates the GPU's compute units. It is compute-bound.

Then comes decode: the model generates the rest of the output one token at a time, each step feeding the previous token back in and reading the now-growing KV cache plus the full weight matrix out of high-bandwidth memory to produce exactly one new token. The arithmetic per step is tiny; the data movement is enormous. Decode is memory-bandwidth-bound, and the GPU's compute units sit largely idle waiting on memory.

This is the whole story in one line: prefill is limited by how fast the chip can compute; decode is limited by how fast it can read memory. They are not two settings of one workload. They are two workloads.

Prefill and decode aren't fast and slow versions of the same thing. They're a compute job and a memory job, and a single GPU batch can be good at one of them.

That split is why a lone throughput figure lies. The metric that matters for prefill is time to first token (TTFT) — it grows with prompt length. The metric for decode is time per output token (TPOT), the inter-token latency your user feels as the answer streams. You can tune a system to win one and lose the other, so any serving comparison that reports only "tokens/sec" is hiding the trade it actually made.

Continuous batching: the first reckoning

The first big move wasn't about the split at all — it was about not wasting the GPU between requests. Old "static" batching waited for an entire batch to finish before starting the next, so a batch of ten finished at the speed of its slowest member while the others' slots sat empty.

Orca (OSDI 2022) introduced iteration-level scheduling — now universally called continuous batching — which admits new requests and retires completed ones at every decoding step. The GPU stays full as requests come and go. Paired with vLLM's PagedAttention (SOSP 2023), which stores the KV cache in non-contiguous pages to kill memory fragmentation, the gains were not incremental: Anyscale measured up to 23x over naive batching, and vLLM reported 24x over vanilla HuggingFace Transformers. This is the single largest throughput unlock in modern serving, and it is why vLLM and its descendants took over the serving layer.

But continuous batching created a subtler problem. Now prefills and decodes share one queue. When a long prompt shows up, its heavy compute-bound prefill stalls every in-flight decode behind it — your other users' tokens stop streaming while one newcomer's 8,000-token prompt grinds through. The throughput went up; the tail latency got jagged.

Chunked prefill: the half-measure

The first patch was chunked prefill: slice a long prompt into pieces and interleave those pieces with ongoing decode steps, so a giant prefill no longer monopolizes an iteration. vLLM, SGLang, and TensorRT-LLM all do this, and it genuinely smooths the spikes. But notice what it is — a scheduling trick to make two workloads share a GPU more politely. It hides the interference. It does not remove it, because the compute job and the memory job are still on the same silicon, still trading off TTFT against TPOT.

Disaggregation: stop sharing the GPU

The honest fix is to stop pretending they belong together. Prefill/decode disaggregation runs the two phases on separate GPU pools and ships the KV cache from the prefill machines to the decode machines over the interconnect. Now each phase gets hardware, batch sizes, and parallelism tuned to its own bottleneck, and a long prefill can never stall a decode because they no longer occupy the same queue.

DistServe (OSDI 2024) made the case with numbers: by eliminating prefill–decode interference and co-optimizing each phase, it served 7.4x more requests or met 12.6x tighter SLOs than colocated systems while keeping 90%+ of requests inside their latency target. Microsoft's Splitwise reached the same conclusion from the power-and-cost angle. Within two years the idea went from research to default — it now ships in vLLM, SGLang, TensorRT-LLM, LMDeploy, and NVIDIA's Dynamo, and runs in production at the scale of providers like DeepSeek.

The takeaway for anyone choosing a serving stack

You don't need disaggregation to serve a model. You need it when you serve many concurrent users under strict latency SLOs and your prompts are long enough to disrupt decodes — its price is the KV-cache transfer and a more complex topology. Below that, continuous batching plus chunked prefill on a single pool is simpler and usually enough — and other levers, from prefix caching to speculative decoding, attack the same latency budget without the topology cost.

But the mental model is the part to keep, because it outlives any one framework: never evaluate an inference setup on one number. Measure TTFT and TPOT separately, know which one your product actually lives or dies on, and remember that every serving optimization since 2022 has been a different answer to the same question — how do you keep a compute job and a memory job from ruining each other's day.