How much VRAM do I need to run a 70B model?

At fp16 the weights alone are about 140 GB (70 billion params times 2 bytes), so you need two 80 GB GPUs (or an H100 NVL pair, ~188 GB) just to load it. Quantized to int4 the weights drop to about 35 GB and fit on a single 40 GB card — but you still must leave room for the KV cache and roughly 10-20% overhead on top.

Why does my LLM run out of memory at long context?

Because the KV cache grows linearly with sequence length and batch size, independent of the weights. Every token you process stores keys and values for every layer, so doubling context or doubling concurrent requests doubles that buffer. This is the exact problem PagedAttention was built to manage.

Does quantization reduce VRAM and at what quality cost?

Yes — it cuts weight memory roughly in proportion to the bit width, so int8 halves and int4 quarters the fp16 footprint. Modern 4-bit formats lose little measurable quality on most tasks, but quantization shrinks only the weights, not the KV cache, so it does not save you at high concurrency.

How Much VRAM Do You Need to Serve an LLM? A 2026 Sizing Guide

Most VRAM sizing goes wrong in the same place. Someone reads that a 70B model "needs about 140 GB," buys the GPUs to fit that number, deploys, watches it serve a demo beautifully, and then watches it crash the moment real users send long prompts in parallel. The weights fit. The server still ran out of memory. The thing nobody budgeted for is the KV cache.

Here is the honest breakdown of what actually consumes GPU memory when you serve an LLM, in the order that matters.

The weights: the part you can do on a napkin

Weight memory is the simple multiplication everyone already knows: number of parameters times bytes per parameter.

fp16 / bf16 = 2 bytes per parameter
int8 = 1 byte per parameter
int4 ≈ 0.5 bytes per parameter

So a 7B model in fp16 is 7e9 × 2 = 14 GB. A 70B model is 70e9 × 2 = 140 GB. That is the entire trick behind the rule of thumb "fp16 GB ≈ 2× the parameter count in billions." Halve it for int8, quarter it for int4. The table at the top of this piece is just that arithmetic applied to the three sizes you actually deploy.

Hugging Face's own model memory estimator makes the same point with a caveat worth repeating: the figure it reports "tells you how much memory is needed to purely load the model in, not to perform inference." For inference, it says, "you can expect to add up to an additional 20%" (a finding it attributes to EleutherAI) for activations and runtime overhead. So your real floor is weights × 1.1 to 1.2.

That is where most guides stop. It is also where most production incidents begin.

The KV cache: the part that actually OOMs you

When a transformer generates text, it caches the key and value tensors for every token it has already seen, at every layer, so it does not recompute them. That cache is not part of the weights. It is allocated per request, and it grows with every token.

The size, for grouped-query attention, is:

2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element

The 2 is for keys and values. The terms that bite are seq_len and batch_size — the two things you do not control at deploy time. Your users do.

Work it for Llama-3-70B (80 layers, 8 KV heads, head dimension 128) in bf16. A single request at 4,096 tokens needs about 1.25 GB of KV cache. Run a batch of 8 at that length and you are at roughly 10 GB — on top of the 140 GB of weights. Push context to 32K or stack 32 concurrent users and that buffer balloons past the weights themselves. This is not a corner case; it is what "production traffic" means.

The weights are a fixed cost you pay once. The KV cache is a variable cost you pay per token, per user, simultaneously — and it is the number that decides whether your server survives Tuesday.

This is precisely why vLLM exists. The PagedAttention paper (Kwon et al., SOSP 2023) opens by naming the problem directly: the KV cache "is huge and grows and shrinks dynamically," and existing systems waste it through "fragmentation and redundant duplication." Their fix — managing the cache as non-contiguous fixed-size blocks, borrowed straight from OS virtual memory — recovered enough wasted space to improve throughput 2-4× at the same latency. You are not choosing whether to deal with the KV cache. You are only choosing whether your serving engine deals with it well.

Headroom: the 10-20% you forget until it's gone

On top of weights and KV cache, the runtime needs working room: activations for the forward pass, CUDA context, the framework's own buffers, fragmentation slack. The conventional allowance is 10-20% of total, and the Hugging Face guidance above lands in the same range. Treat a GPU as "full" at roughly 85% of its nameplate VRAM, not 100%.

That nameplate matters. An NVIDIA H100 SXM ships with 80 GB of HBM3; the H100 NVL pairs two boards for 188 GB total. The A100 comes in 40 GB and 80 GB variants. So a 70B model at fp16 (140 GB weights, before cache and overhead) genuinely needs two 80 GB cards — which is why people reach for quantization or smaller cards and a tighter context budget. If you are picking the card itself, that trade-off is its own decision: see H100 vs H200 vs A100 vs L40S.

Quantization: cuts the weights, not the cache

Quantization is the most effective lever on weight memory. int8 halves it; int4 quarters it. A 70B model drops from 140 GB to about 35 GB at int4 — suddenly a single 40 GB GPU is in play. Modern 4-bit formats lose remarkably little measurable quality on most workloads; the format you pick (and where it runs) is the real question, covered in GGUF vs GPTQ vs AWQ.

But note what quantization does not touch. Most engines keep the KV cache in fp16 or bf16 by default, so shrinking the weights 4× does nothing for the buffer that scales with your traffic. KV-cache quantization is a separate, newer setting you have to opt into. Quantize the weights to fit the model; you still size the cache to survive the load.

The sizing recipe

Weights = params × bytes-per-param (2 / 1 / 0.5 for fp16 / int8 / int4).
KV cache = the GQA formula above, at your peak batch size and your max context — not the demo's.
Overhead = add 10-20% on top, and stop at ~85% of nameplate VRAM.

Size for the cache, not just the weights, and you stop being surprised on Tuesday.

Model size	Weights @ fp16	Weights @ int8	Weights @ int4
7B	~14 GB	~7 GB	~3.5 GB
13B	~26 GB	~13 GB	~6.5 GB
70B	~140 GB	~70 GB	~35 GB

How Much VRAM Do You Need to Serve an LLM? A 2026 Sizing Guide

The weights: the part you can do on a napkin

The KV cache: the part that actually OOMs you

Headroom: the 10-20% you forget until it's gone

Quantization: cuts the weights, not the cache

The sizing recipe

Frequently asked

Dex Mareno

Dispatches from the machines, in your inbox

How Much VRAM Do You Need to Serve an LLM? A 2026 Sizing Guide

The weights: the part you can do on a napkin

The KV cache: the part that actually OOMs you

Headroom: the 10-20% you forget until it's gone

Quantization: cuts the weights, not the cache

The sizing recipe

Frequently asked

Dex Mareno

Continue reading

Mixture-of-Experts vs Dense Models for Agents: The VRAM Bill You Didn't Budget For

MCP vs Function Calling: When You Actually Need a Server

BentoML vs Ray Serve vs KServe: Choosing a Model-Serving Framework

Dispatches from the machines, in your inbox