A 70B model at FP8 weighs about 70GB. Drop it onto an H100 and it fits — barely, with roughly 10GB left over for the KV cache and overhead. Drop the same model onto a B200 and it occupies a little over a third of the card. That single difference — not FLOPS, not clock speed — is most of the story of why Blackwell is faster at inference, and it's the part the headline number hides.
The headline number is real but lazy: the B200 is widely quoted at 5-6x the inference throughput of an H100. Spheron's testing puts a single B200 at ~17,500 tokens/sec on Llama-2-70B against ~3,000 for an H100. The problem is that "5-6x" is two unrelated upgrades wearing one jersey, and treating it as a single dial is how teams overpay.
The two upgrades inside one number
Blackwell improved inference along two axes at once.
- Memory. The B200 carries 192GB of HBM3e at ~8TB/s. The H100 has 80GB of HBM3 at 3.35TB/s; the H200 sits between them at 141GB and 4.8TB/s (Civo, GMI Cloud). More capacity means bigger models and longer context fit without sharding; more bandwidth means the chip can feed its compute units faster.
- Compute. Blackwell adds native FP4 / NVFP4, a 4-bit format that roughly doubles compute density over FP8 — about 9,000 TFLOPS dense FP4 versus ~4,500 FP8 on the B200, per Civo. FP4 also halves the memory footprint of weights, the same way moving down the quantization ladder always trades precision for room.
These two upgrades pay off in completely different workloads. And which one is doing the work decides what you should buy.
Memory-bound vs compute-bound is the whole question
Modern LLM serving is mostly memory-bound, not compute-bound. The Tensor Cores can chew through tokens faster than HBM can deliver weights and KV-cache, so throughput is gated by bandwidth, not math. This is clearest in the decode phase, where each token reads the full model and the growing KV cache from memory. Prefill — the bulk parallel pass over the prompt — is the compute-heavy half.
The cleanest proof is the H200. It uses the same Hopper die as the H100, with identical FP8 compute — same 3,958 TFLOPS, same Tensor Cores (Spheron). The only thing that changed is memory: 141GB at 4.8TB/s. And yet the H200 delivers a real 1.4-1.6x in production-realistic serving, up to 1.9x in NVIDIA's own Llama-2-70B test (GMI Cloud). Zero extra FLOPS, up to 1.9x throughput. That gap is entirely the memory wall, made visible.
The H200 has the same compute as the H100 and serves up to 1.9x the tokens. If a generation can be that much faster while touching nothing but memory, you are not compute-bound.
Long context makes the wall taller. CloudRift's independent benchmark at 8K input / 8K output found the H100 loses 64% of its throughput versus short context, while the H200 loses only 47% (CloudRift) — exactly the signature of a KV-cache-heavy, bandwidth-starved workload, where the card with more HBM holds up better. If you've ever watched TTFT and inter-token latency degrade as sessions grow, this is the hardware reason.
Where the B200 actually wins biggest
Put those together and the B200's advantage is largest precisely where the job is bandwidth-bound: large models, long context, and big-batch decode. Its 8TB/s and 192GB attack the exact bottleneck the H200 already showed is binding.
The FP4 half is more conditional. It only converts to throughput when you're compute-bound and you can quantize without unacceptable quality loss — which is workload- and model-dependent, not free. NVIDIA's MLPerf Inference v5.0 numbers lean hard on FP4: GB200 delivered up to ~3.4x per-GPU versus an H200 system on Llama-3.1-405B, roughly 200 vs 70 tokens/sec per GPU (NVIDIA). Read that as what it is: an offline MLPerf result using FP4, on one of the largest models, in a tuned TensorRT-LLM-class stack — the best case, not the median case. Production-realistic serving lands lower. When sources cite "5-6x," they are blending the FP4 ceiling with the memory floor.
A more grounded production figure: on GPT-OSS-120B, Clarifai measured a single B200 sustaining ~7,236 tokens/sec at high concurrency, and concluded one B200 can replace roughly two H100s with lower latency and less complexity. That "replaces two H100s" is the number that should drive a purchase, not the 5-6x.
The per-dollar answer depends on utilization, not specs
Indicative cloud pricing runs about $2.00/GPU-hr for an H100, $2.60 for an H200, $4.00 for a B200 (GMI Cloud). So a B200 has to clear roughly two H100s on real throughput to win on cost — which it does for high-concurrency, large-model, long-context decode, and may not for a small model at low batch where you can't keep 192GB and FP4 busy. Peak specs are an upper bound; the per-dollar winner is set by utilization.
| If your workload is… | Bottleneck | Sensible pick |
|---|---|---|
| Small model, modest batch, fits in 80GB | Often compute-bound | H100 — cheapest per token if utilized |
| 70B+, long context, high concurrency | Memory-bound (KV cache) | H200, or B200 if you can fill it |
| Largest models, big-batch, FP4-tolerant | Compute + memory | B200 — best case for the 3-4x |
The decision rule is short. Size the memory you actually need for weights plus KV cache at your context and batch; decide whether your hot path is prefill (compute) or decode (memory); then check whether the faster card stays busy enough to beat cheaper cards on tokens-per-dollar — the same arithmetic that governs self-hosting versus an API. For the older Hopper-and-below tradeoffs underneath all this, the H100/H200/A100/L40S breakdown still holds.
Blackwell didn't make one thing 6x faster. It widened the memory pipe and added a 4-bit gear. Buy the one your bottleneck is actually starving on — and price it on the tokens you'll really serve, not the tokens the datasheet promises.



