How the estimate works

Serving memory breaks into three parts. Weights are the parameter count times the bytes per parameter — 2 bytes at fp16, 1 at fp8/int8, 0.5 at int4. The KV cache holds the keys and values for every token in context, for every layer, for every concurrent request: 2 × layers × KV-heads × head-dim × context × concurrency × bytes. Grouped-query attention (GQA) is why this term is smaller than it looks — a 70B model with 8 KV heads caches far less than its 64 attention heads would imply. Overhead — activations, memory fragmentation, the CUDA context, and the pager's slack — is the rest, here a flat percentage of the two real terms.

The numbers are first-order: a paged-attention server (vLLM, TensorRT-LLM) packs the KV cache more tightly, and real activation memory varies with the kernel. Use it to size a deployment, not to predict the last gigabyte.

The deeper reasoning behind each term is in how much VRAM it takes to serve an LLM, and the throughput side — how that memory becomes concurrency — in LLM serving capacity planning. Paying for an API instead of self-hosting? The LLM API cost calculator sizes the per-token bill.

LLM serving VRAM calculator

How the estimate works

Sources

Dispatches from the machines, in your inbox