Walk into any GPU-buying conversation and the first number on the table is peak FLOPS. It is the number on the marketing slide, the number in the model name's spiritual ancestry, the number people compare across cards like horsepower. For training, it is even the right number to care about. For serving a model to users, it is mostly a distraction — and the reason is buried one line lower on the same spec sheet.
Decode is a memory problem wearing a compute costume
An LLM does two very different things at inference time. Prefill ingests your prompt: it runs the whole input through the model in parallel, a dense matrix-matrix operation that genuinely saturates the Tensor Cores. This is the compute-bound phase, and FLOPS matter here.
Then comes decode — generating the answer one token at a time. Each new token is a matrix-vector product: load the entire set of model weights out of HBM, multiply against a single token's worth of activations, append to the KV cache, repeat. At small batch sizes the arithmetic intensity of this loop is on the order of one or two FLOPs per byte moved — hundreds of times below what the GPU's compute roofline can absorb. The Tensor Cores sit mostly idle, waiting on memory. The wall the clock hits is how fast the card can stream weights and KV cache out of HBM, not how fast it can multiply.
For most production agents, decode dominates the wall-clock time. Which means the spec that governs your throughput is memory bandwidth, and the spec that governs your batch size — and therefore your throughput-per-dollar — is VRAM capacity.
You are not buying a calculator. You are buying a memory bus with a calculator attached.
The H200 is the experiment that settles the argument
You do not have to take the theory on faith, because NVIDIA shipped the controlled experiment. The H200 is the same Hopper compute silicon as the H100 — identical 989 dense BF16 TFLOPS, identical FP8 Tensor Core rates, zero extra math. The only thing that changed is the memory: 141GB of HBM3e at roughly 4.8TB/s, against the H100's 80GB at ~3.35TB/s.
If inference were compute-bound, that swap would do nothing. Instead, NVIDIA's own TensorRT-LLM numbers show the H200 delivering up to ~1.9x the Llama-2-70B throughput of an H100. Same FLOPS, nearly double the tokens — the entire gain comes from ~1.4x the bandwidth and the headroom to hold a bigger batch. It is the cleanest natural experiment in the lineup, and it points one direction: for decode, memory is the performance.
Why the KV cache makes capacity non-negotiable
The second-order reason capacity matters is the KV cache. Every active sequence keeps a key-value cache that grows with its length, and the total grows with how many sequences you batch concurrently. The PagedAttention work behind vLLM exists precisely because naive KV-cache allocation fragmented GPU memory so badly that systems wasted most of it, capping how many requests they could batch.
That is the whole game in high-concurrency serving: weights take a fixed slice of VRAM, and whatever is left is the budget for batched KV cache. A card with more capacity holds more concurrent sequences, which raises throughput on the same memory-bound decode loop. This is why an H200 or a Blackwell B200 (192GB, ~8TB/s) is what you reach for when the workload is long-context and high-concurrency — not because they compute harder, but because they hold more and stream faster.
Where each card actually lands
- A100 80GB — still a perfectly good workhorse at ~2.0TB/s and widely available. Its real limitation is that it predates Hopper, so it has no native FP8. You miss both the FP8 throughput and the FP8 KV-cache memory savings that let newer cards stretch further. Fine for BF16 inference; not the frontier.
- H100 SXM — the high-throughput serving default, and the card every benchmark is normalized against. FP8 via the Transformer Engine, 3.35TB/s, 80GB.
- H200 — the same chip with the memory it always wanted. The pick when context length or concurrency is your bottleneck, which for agent workloads it usually is.
- L40S — the seductive line item. Its Ada FP8 compute (~362 dense BF16 TFLOPS) is respectable and its price and 350W TDP are friendly. But at ~0.86TB/s GDDR6 and 48GB, it is bandwidth- and capacity-starved for large-model decode — roughly four times less bandwidth than an H100 SXM. Excellent for small models, prefill, and vision; a cost trap if you put a 70B model behind it and expect serving throughput.
The corollary buyers keep missing: a card that looks weaker on the FLOPS line (H200 vs anything compute-heavier) can win on serving, and a card with strong FLOPS but weak memory (L40S) will underperform its own spec sheet on big models. Pick the engine — vLLM, TensorRT-LLM, or TGI — for the software ergonomics. Pick the GPU for bandwidth and capacity, and let the FLOPS be a tiebreaker.
Spec figures are drawn from NVIDIA's published datasheets and product pages; the H200-vs-H100 throughput uplift is NVIDIA's own TensorRT-LLM benchmark. Cloud rental prices vary widely by provider and are deliberately omitted here as too volatile to cite as fact.



