The question arrives in a planning meeting, and it is always phrased wrong. How many GPUs do we need to serve this model? Someone pulls up a spec sheet, points at the teraflops, divides by a guess, and writes a number on a slide. It is wrong in a predictable direction, because the spec sheet describes a quantity the workload barely uses.
Serving an LLM is not a FLOPs problem. It is a memory problem. Get that straight and the rest of capacity planning is napkin arithmetic.
Decode is memory-bound, and that changes everything#
LLM inference has two phases on opposite sides of the roofline. Prefill processes the whole prompt in parallel — a big matrix multiply, genuinely compute-bound, where tensor cores earn their keep. Then decode: the model generates one token, appends it, generates the next, autoregressively. Each step is small. To produce a single token, the GPU streams the entire model weights out of HBM, plus the growing KV cache, and does comparatively little math with them.
That is the whole story. Decode is bottlenecked on memory bandwidth, not compute. Databricks says it plainly in their inference guide: text generation is "memory-bandwidth-bound," and what matters is how fast you move bytes, not how many FLOPs you issue. So of two GPUs with the same teraflops, the one with more and faster VRAM serves more users. The H100 SXM and H200 are the cleanest demonstration: identical compute silicon, identical 1,979 BF16 TFLOPS (with sparsity), yet the H200's 141 GB at 4.8 TB/s serves far more concurrent traffic than the H100's 80 GB at 3.35 TB/s. Same compute number. Different serving capacity.
Stop reading the FLOPs column. The capacity you can sell lives in the memory column, and it is set by the KV cache, not the tensor cores.
The KV cache is your real budget#
Every active request holds a KV cache: the keys and values for every token it has seen, kept in VRAM so the model doesn't recompute attention from scratch each step. This is the line item that caps concurrency. The per-token cost is a clean formula:
kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element
The 2 is for storing both keys and values. num_kv_heads — not query heads — is what counts, which is why grouped-query attention (GQA) is such a capacity win: it shrinks this term directly.
Take a Llama-3-70B-class model: 80 layers, 8 KV heads (GQA), head dimension 128. In BF16 (2 bytes):
2 × 80 × 8 × 128 × 2 = 327,680 bytes per token — about 0.31 MB per token.
That sounds tiny until you multiply by context. A request running at 8,192 tokens of context holds 327,680 × 8,192 ≈ 2.68 GB of KV cache. A 128K-context request holds 327,680 × 131,072 ≈ 42.9 GB — over half an H200, for a single user.
A worked example: how many requests fit on one H200#
Now the capacity calculation, end to end, on one H200 with 141 GB.
- Weights. 70B parameters. In BF16 that is 140 GB — it does not even fit. So you quantize. At FP8/INT8 (~1 byte/param) the weights are 70 GB.
- Usable VRAM. vLLM defaults
gpu_memory_utilizationto 0.9, reserving the rest for activations and overhead:0.9 × 141 ≈ 127 GB. - KV budget.
127 − 70 = 57 GBleft for KV cache. - Per-request footprint at 8K context. 2.68 GB, from above.
- Max concurrency.
57 / 2.68 ≈ 21 concurrent requests.
Twenty-one. On a 141 GB flagship — not because the GPU ran out of math, but because it ran out of memory to hold conversations. Want more? Three honest levers: quantize the KV cache itself (FP8 KV roughly halves the per-token bytes, pushing toward ~42 concurrent), shorten the context you provision for, or add GPUs. No FLOPs trick buys it back.
Offline vs online: the SLO tax#
The 21-request number is a memory ceiling. Whether you can actually run at it depends on what you're promising.
Offline / batch workloads — evals, bulk summarization, synthetic data — only care about total throughput. You pack the largest batch the KV cache allows and let latency float. Here you run near the memory ceiling.
Online serving is bounded by a latency SLO: TTFT for the first token, TPOT for each after. Bigger batches raise throughput but lengthen TPOT, since every request in the batch shares the same memory-bandwidth pipe each decode step. So you cap batch size below the memory limit to protect tail latency, and effective concurrency drops under 21. The memory math gives you a ceiling; the SLO tells you how far below it you actually live.
Continuous batching and PagedAttention raise the real number#
Any of this works at scale only because naive serving wastes most of the KV cache to fragmentation and padding. PagedAttention, from the vLLM paper, borrows OS virtual memory: it chops the KV cache into fixed-size blocks mapped through a page table, cutting waste under 4% and letting requests share physical blocks. Continuous batching then swaps finished requests out and new ones in mid-flight, instead of waiting for the batch to drain. Together vLLM reports 2–4× the throughput of prior systems at the same latency. Prefix caching hits the other phase: when requests share a long system prompt, vLLM reuses the cached KV blocks instead of re-running prefill. None of these change the per-token formula — they change how little of your budget you waste, which is the same as raising effective concurrency.
The recipe#
- Weights:
params × bytes_per_paramat your quantization. If it doesn't fit with room to spare, quantize or shard. - KV budget:
(GPU_VRAM × utilization) − weights. - Per-request KV:
2 × num_layers × num_kv_heads × head_dim × bytes × target_context. - Max concurrency per GPU: KV budget ÷ per-request KV. Discount it for your latency SLO.
- Replica count:
peak_demand_tokens_per_sec / sustained_tokens_per_sec_per_replica, measured — not guessed — at your batch size and context.
That last number you do not get from a spec sheet. You benchmark it on your model, your context distribution, your engine. The teraflops column never enters the calculation. It never did.



