Here is a fact that decides the economics of running a language model: two requests sent to the same GPU almost never take the same amount of time. One asks for a yes/no; the other wants a thousand-token essay. Whatever your serving system does about that mismatch is, to a first approximation, your entire throughput story. Continuous batching is the answer that won, and the most useful way to understand it is to watch the naive approach fail first.

Static batching: everyone waits for the slowest

The obvious way to use a GPU efficiently is to batch: collect several requests, run them through the model together, amortize the cost of loading the weights across all of them. This is static (or dynamic) batching, and it works beautifully when every item in the batch is the same size — which, for image classification, it is.

For autoregressive generation it is a disaster, because outputs are variable-length. You pad the batch to the longest sequence and run it to completion. The request that needed eight tokens finishes in a fraction of the time, then its slot on the GPU sits idle — still allocated, still counted against memory, producing nothing — until the thousand-token request in the next lane is done. New requests queued behind the batch can't start until the whole thing drains. You are paying for a full GPU and using a sliver of it. The longer the variance in output lengths, the worse the waste.

Continuous batching: reschedule every token

The fix, introduced as iteration-level scheduling in the Orca paper (Yu et al., OSDI 2022), is to stop thinking in batches and start thinking in steps. A generation is just a loop that emits one token per forward pass. So make the scheduler run on that same loop. At every decode step, check which sequences finished, evict them immediately, and admit waiting requests into the freed slots. The batch is reassembled every single token. No sequence ever waits for another to finish; the GPU never holds an idle lane while work is queued.

Static batching schedules requests. Continuous batching schedules tokens. That one change in granularity is the difference between a starved GPU and a saturated one.

Orca paired this with selective batching — batching only the operations where it's safe to (the big matrix multiplies) while handling attention per-sequence — and reported up to a 36.9× throughput improvement over NVIDIA FasterTransformer at the same latency on GPT-3 175B. Anyscale's widely cited benchmark measured up to **23× over static batching while reducing p50 latency, with the gap widening as output-length variance grew. vLLM then fused continuous batching with PagedAttention (paged KV-cache memory) and reported 2–4× over FasterTransformer and Orca in its SOSP 2023 paper. The numbers vary with the workload, but the direction never does: this is the single largest lever in LLM serving, and it's why vLLM, SGLang, and the other modern engines all do it. NVIDIA ships the identical mechanism under the name in-flight batching**.

The catch nobody mentions in the throughput chart

Continuous batching is not free, and the reason is the thing that makes it work. When you admit a new request mid-stream, the first thing it must do is prefill — process its entire prompt to build a KV cache. Prefill is compute-bound: a big, dense burst of matrix math. The decodes already running in the batch are memory-bandwidth-bound: each step moves a lot of weights to generate one token, leaving the GPU's arithmetic units mostly idle. (This split is the whole subject of prefill vs decode.)

Orca and vanilla vLLM resolve the collision the blunt way: they stall the decodes to run the prefill. The result is a latency hazard hiding inside the throughput win — every newly admitted request can spike the time-to-first-token of others and stutter their inter-token latency. Your average throughput looks superb while your tail latency quietly degrades, which is exactly the metric a chat UI lives or dies on.

The frontier: stop the two phases from colliding

The last two years of serving research are essentially one long argument about that collision.

The arc is clean. Static batching wasted the GPU. Continuous batching saturated it but turned a utilization problem into a scheduling problem. Chunked prefill and disaggregation are the answers to the scheduling problem. If you're choosing a serving engine in 2026, "does it do continuous batching" is no longer the question — they all do. The real question is what it does about the prefill–decode collision that continuous batching created.