Most VRAM sizing goes wrong in the same place. Someone reads that a 70B model "needs about 140 GB," buys the GPUs to fit that number, deploys, watches it serve a demo beautifully, and then watches it crash the moment real users send long prompts in parallel. The weights fit. The server still ran out of memory. The thing nobody budgeted for is the KV cache.
Here is the honest breakdown of what actually consumes GPU memory when you serve an LLM, in the order that matters.
The weights: the part you can do on a napkin
Weight memory is the simple multiplication everyone already knows: number of parameters times bytes per parameter.
- fp16 / bf16 = 2 bytes per parameter
- int8 = 1 byte per parameter
- int4 ≈ 0.5 bytes per parameter
So a 7B model in fp16 is 7e9 × 2 = 14 GB. A 70B model is 70e9 × 2 = 140 GB. That is the entire trick behind the rule of thumb "fp16 GB ≈ 2× the parameter count in billions." Halve it for int8, quarter it for int4. The table at the top of this piece is just that arithmetic applied to the three sizes you actually deploy.
Hugging Face's own model memory estimator makes the same point with a caveat worth repeating: the figure it reports "tells you how much memory is needed to purely load the model in, not to perform inference." For inference, it says, "you can expect to add up to an additional 20%" (a finding it attributes to EleutherAI) for activations and runtime overhead. So your real floor is weights × 1.1 to 1.2.
That is where most guides stop. It is also where most production incidents begin.
The KV cache: the part that actually OOMs you
When a transformer generates text, it caches the key and value tensors for every token it has already seen, at every layer, so it does not recompute them. That cache is not part of the weights. It is allocated per request, and it grows with every token.
The size, for grouped-query attention, is:
2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element
The 2 is for keys and values. The terms that bite are seq_len and batch_size — the two things you do not control at deploy time. Your users do.
Work it for Llama-3-70B (80 layers, 8 KV heads, head dimension 128) in bf16. A single request at 4,096 tokens needs about 1.25 GB of KV cache. Run a batch of 8 at that length and you are at roughly 10 GB — on top of the 140 GB of weights. Push context to 32K or stack 32 concurrent users and that buffer balloons past the weights themselves. This is not a corner case; it is what "production traffic" means.
The weights are a fixed cost you pay once. The KV cache is a variable cost you pay per token, per user, simultaneously — and it is the number that decides whether your server survives Tuesday.
This is precisely why vLLM exists. The PagedAttention paper (Kwon et al., SOSP 2023) opens by naming the problem directly: the KV cache "is huge and grows and shrinks dynamically," and existing systems waste it through "fragmentation and redundant duplication." Their fix — managing the cache as non-contiguous fixed-size blocks, borrowed straight from OS virtual memory — recovered enough wasted space to improve throughput 2-4× at the same latency. You are not choosing whether to deal with the KV cache. You are only choosing whether your serving engine deals with it well.
Headroom: the 10-20% you forget until it's gone
On top of weights and KV cache, the runtime needs working room: activations for the forward pass, CUDA context, the framework's own buffers, fragmentation slack. The conventional allowance is 10-20% of total, and the Hugging Face guidance above lands in the same range. Treat a GPU as "full" at roughly 85% of its nameplate VRAM, not 100%.
That nameplate matters. An NVIDIA H100 SXM ships with 80 GB of HBM3; the H100 NVL pairs two boards for 188 GB total. The A100 comes in 40 GB and 80 GB variants. So a 70B model at fp16 (140 GB weights, before cache and overhead) genuinely needs two 80 GB cards — which is why people reach for quantization or smaller cards and a tighter context budget. If you are picking the card itself, that trade-off is its own decision: see H100 vs H200 vs A100 vs L40S.
Quantization: cuts the weights, not the cache
Quantization is the most effective lever on weight memory. int8 halves it; int4 quarters it. A 70B model drops from 140 GB to about 35 GB at int4 — suddenly a single 40 GB GPU is in play. Modern 4-bit formats lose remarkably little measurable quality on most workloads; the format you pick (and where it runs) is the real question, covered in GGUF vs GPTQ vs AWQ.
But note what quantization does not touch. Most engines keep the KV cache in fp16 or bf16 by default, so shrinking the weights 4× does nothing for the buffer that scales with your traffic. KV-cache quantization is a separate, newer setting you have to opt into. Quantize the weights to fit the model; you still size the cache to survive the load.
The sizing recipe
- Weights = params × bytes-per-param (2 / 1 / 0.5 for fp16 / int8 / int4).
- KV cache = the GQA formula above, at your peak batch size and your max context — not the demo's.
- Overhead = add 10-20% on top, and stop at ~85% of nameplate VRAM.
Size for the cache, not just the weights, and you stop being surprised on Tuesday.



