Every token an LLM emits is the product of two very different computations, and almost everything hard about serving models at scale comes from pretending they are one.
The first phase, prefill, reads your entire prompt and builds the key-value cache in a single forward pass. It is compute-bound — it saturates the GPU's matrix units — and it is what you wait on before the first token appears. The metric it owns is time-to-first-token (TTFT). The second phase, decode, generates output one token at a time, each step re-reading that growing KV cache from memory. It is memory-bandwidth-bound, it barely touches the compute units, and it owns time-per-output-token (TPOT), the cadence of the stream you actually read. (If that split is new to you, our prefill vs decode primer is the prerequisite for this piece.)
Put both on the same GPU, batched together the way vLLM and every classic server do, and they fight. A single long prompt arriving mid-stream throws a heavy prefill into the batch, monopolizes the compute units, and every other user's token stream hitches while it runs. The DistServe team named this prefill-decoding interference, and they made the sharper observation underneath it: colocation doesn't just cause stalls, it couples your resource and parallelism decisions. Tune the batch for snappy TTFT and you starve decode throughput; tune it for decode and first-token latency balloons. One knob, two jobs, pulling opposite directions.
Split the pipeline, ship the cache#
Disaggregated serving makes the obvious move once you accept the phases are different machines wearing one coat: give each its own GPUs. A pool of prefill workers does nothing but ingest prompts and produce KV caches. A separate pool of decode workers does nothing but generate. A request hits prefill first; its KV cache is then handed to a decode worker, which streams the answer.
That handoff is the whole engineering problem. The KV cache for a long prompt is gigabytes, and it has to cross from one GPU's VRAM to another's before the first token can land. The 2026 production stacks solve it with a dedicated transport: NVIDIA Dynamo and llm-d use NIXL to copy KV tensors directly VRAM-to-VRAM over NVLink or InfiniBand, and — the detail that makes it viable — the transfer is non-blocking, so the prefill GPU keeps serving other requests while the bytes move.
Disaggregation doesn't make inference cheaper. It converts an interference problem into a networking-and-scheduling problem — and that is only a trade worth making once you have the scale to win it.
The payoff, when it lands, is large. DistServe (OSDI 2024) reported serving 7.4× more requests, or holding a 12.6× tighter SLO, than state-of-the-art colocated systems while keeping more than 90% of requests inside their latency targets — precisely because it could size and parallelize prefill and decode separately and stop them from interfering. At the top end, Moonshot's Mooncake, the KV-cache-centric platform behind Kimi, runs Kimi K2 across 128 H200 GPUs with prefill-decode disaggregation, reporting roughly 224k tokens/sec of prefill and 288k tokens/sec of decode throughput. By mid-2026 disaggregation is no longer exotic: it's a first-class mode in Dynamo, llm-d, vLLM and SGLang, each with its own KV connector.
The line where it stops paying#
Here is the part the architecture diagrams leave out. Disaggregation has a tax: the KV transfer, the extra interconnect, and a router that has to pair a prefill worker with a decode worker for every request. Below a certain scale, that tax exceeds the interference it removes.
The colocated camp didn't stand still. Chunked prefill — splitting a long prompt into slices and interleaving them with decode steps in the same continuous batch, the Sarathi-Serve approach — claws back most of the anti-interference benefit without moving a single byte between GPUs. On one node, with mixed traffic and no separate TTFT and TPOT contracts to honor, chunked-prefill colocation is simpler to run and often just plain faster. It's also the natural sibling of continuous batching, which you're probably already running.
So the real decision isn't "disaggregated or not." It's a threshold question. You reach for disaggregation when three things are true at once: you have genuinely distinct and tight TTFT and TPOT SLOs (an interactive product where both first-token and stream smoothness are contractual); your traffic mixes long prompts with long generations so the phases really do collide; and you have enough GPUs and a fast enough fabric that running two specialized pools beats running one general one. Reasoning models and long-context agents — heavy prefills, long decodes, latency-sensitive — sit squarely in that zone, which is exactly why the hyperscalers built this first.
If you're serving a few GPUs of modest, bursty load, disaggregation is a beautiful answer to a question you don't have yet. Turn on chunked prefill, watch your TTFT and TPOT curves, and split the pipeline the day they start fighting in your traces — not before.



