The MI300X-versus-H100 question almost always gets asked the wrong way. People line up the FLOPS, find them close enough, and conclude it's a wash with NVIDIA's software as the tiebreaker. That framing imports an assumption from training — that inference is compute-bound — and it's wrong about the part of the job that actually runs up your bill.

LLM inference has two phases. Prefill ingests the prompt in parallel and is compute-bound; it's the brief part. Decode generates tokens one at a time, and for each token it must stream the entire model's weights plus a growing KV cache out of memory while doing comparatively little arithmetic. Decode is memory-bandwidth-bound, and decode is most of what a serving fleet does. So the spec that governs throughput isn't peak TFLOPS — it's how much memory you have and how fast you can read it.

On that axis the two chips aren't close. The AMD MI300X carries 192 GB of HBM3 at 5.3 TB/s. The NVIDIA H100 carries 80 GB at 3.35 TB/s. That's 2.4x the capacity and roughly 1.6x the bandwidth, on the dimension that decides decode.

What capacity actually buys#

More memory isn't a vanity number; it changes the topology of how you serve. A 70B model in FP16 needs ~140 GB just for weights — it doesn't fit on one 80 GB H100, so you split it across two with tensor parallelism. On a single MI300X, it fits. A 405B model that wants four-to-six H100s can land on two MI300X. (Treat the exact GPU counts as illustrative — they shift with quantization and KV budget — but the direction is firm.)

Every time you avoid a shard, you avoid the cross-GPU synchronization that tensor parallelism demands on every layer. That all-reduce traffic is pure overhead, and it scales the wrong way. Fewer GPUs holding the same model means less of your decode time spent waiting on the interconnect — and it means the headroom you'd have spent on weights can go to a longer KV cache or a bigger batch, both of which raise tokens-per-second. (This is the same lever continuous batching pulls in software — more concurrent sequences in flight — except here you're buying the room in hardware.) The 192 GB shows up twice: once letting the model fit, once letting the batch grow.

Peak FLOPS is a training argument. For the decode loop that dominates inference, the GPU with more, faster memory is usually holding the better hand.

The tax nobody could ignore#

So why hasn't everyone switched? Because for a long time the silicon wasn't the bottleneck — the software was. SemiAnalysis's December 2024 teardown was brutal and fair: the MI300X was "not usable out of the box," trailed the H100 and H200 by more than 2.5x in real throughput on then-current public builds, and the gap traced to immature ROCm kernels and thin internal testing, not to the hardware. A fixed attention bug took months to reach AMD's public PyTorch builds. The chip was strong; the stack wasn't ready to spend it.

That is the part of the story that has actually moved. Through 2025 and into 2026, AMD shipped AITER kernels and FP8 GEMM through hipBLASLt, and — more importantly — the open serving engines stopped treating ROCm as a port and started treating it as a target. vLLM now calls ROCm first-class; its AITER attention backend reports 2.7–4.4x over the legacy ROCm attention path. SGLang ships official ROCm images with day-zero support for models like DeepSeek-R1. The tax isn't zero — peak numbers still want knob-twiddling — but it's a fraction of what it was when the CUDA-moat narrative set.

The benchmarks track the thaw. SemiAnalysis's own follow-up found that for most scenarios the MI300X still wasn't beating an H200 on performance or perf-per-dollar — except for the largest models, Llama-3 405B and DeepSeek-V3 670B, where it beat the H100 on both. And on MLPerf Inference v5.0, a 32x MI300X cluster turned in 103,182 tokens/sec offline on Llama-2-70B, roughly 24% over the prior 32x H100 result, with clean scaling from 8 to 32 GPUs.

So which one#

The decision splits cleanly by what you're serving:

And calibrate the comparison: if memory is your reason for looking at AMD, the honest NVIDIA rival is the H200 (141 GB, 4.8 TB/s), not the 80 GB H100. It closes most of the capacity gap without leaving CUDA — which is exactly the trade the whole question turns on. With Blackwell already shipping, the memory race is the one to watch; the FLOPS race was never the one that mattered for the decode loop.