Every serving stack eventually arrives at the same fork: the model is trained, the GPUs are rented, and someone has to decide how many bits each number gets. The menu reads FP8, INT8, INT4 — three rows that look like three settings on one quality dial, from "barely lossy" to "aggressively small." That framing is the mistake. These formats are not three points on one axis. They pay down different bottlenecks, and if you pick by accuracy leaderboard alone you will optimize the wrong one.
What each format is actually quantizing
Start with the detail the marketing hides: what gets converted.
When people say INT4, they almost always mean weight-only quantization — the W4A16 recipe behind AWQ and GPTQ. The weights drop to about half a byte each; the activations flowing through the network stay 16-bit. (This is the same weight-only logic as the GGUF/GPTQ/AWQ formats local runners ship.) FP8 and INT8, by contrast, are usually W8A8 — both the weights and the activations are 8-bit. That single difference decides everything downstream.
INT4 makes the model smaller. FP8 makes the model's math faster. Those are not the same purchase, and most teams buy one while thinking they bought the other.
The two bottlenecks, and which format touches which
LLM inference has two phases with opposite characters. Prefill — chewing through the prompt — is compute-bound: the GPU's tensor cores are the limit. Decode — emitting tokens one at a time — is memory-bandwidth-bound: the chip spends most of its time streaming weights out of VRAM, not multiplying.
Now map the formats onto that:
- INT4 weight-only shrinks the bytes you stream, so it speeds up decode and lets a 70B model fit on hardware it otherwise wouldn't. But the multiplies still run in FP16 after the weights are dequantized on the fly — so on prefill, and on any GPU without native 4-bit tensor cores, it buys you little compute speedup. Reported gains run ~2.5–2.7x over BF16, but that number lives almost entirely in decode-heavy, memory-bound workloads.
- FP8 W8A8 runs on the native FP8 tensor cores in Hopper and Blackwell, so it actually halves the compute, accelerating prefill and decode alike — typically 1.4–1.7x throughput at a quality cost under half a point on MMLU-Pro.
- INT8 W8A8 is the fallback for silicon with no FP8 tensor cores: it still accelerates on integer units, which nearly every accelerator has.
So FP8 and INT4 aren't really competitors. FP8 is the faster-math play; INT4 is the fit-it-and-feed-decode play. INT8 is what you reach for when FP8 hardware isn't there.
The exponent is the whole story
Here is the part worth keeping. Why does FP8 quantize activations more gracefully than INT8, and NVFP4 beat plain INT4? Because floating-point formats spend bits on an exponent. INT8 lays 256 values down at even spacing; FP8 places them exponentially, with fine resolution near zero and a long reach toward the extremes. Transformer activations are full of outliers — a few values orders of magnitude larger than the rest — and a format with wide dynamic range swallows them where an evenly-spaced integer grid clips or coarsens them. That's the mechanism, not a vibe: it's why FP8 lands ~0.3–0.5 points off FP16 while INT8 gives up ~0.5–0.9, and why NVIDIA's NVFP4 — a 4-bit float with a two-level FP8/FP32 scaling scheme — recovers accuracy that naive INT4 leaves on the floor.
The counterintuitive footnote
If floating point is so good for transformers, why did anyone ever favor integers? Because in dedicated silicon at a fixed accuracy, INT8 is the more area- and power-efficient choice — that was the headline of a careful 2023 Qualcomm study comparing FP8 and INT8 for inference. FP8 didn't win the deployment war on intrinsic efficiency. It won because NVIDIA put FP8 tensor cores in Hopper, and because FP8 is the format you can post-train-quantize to without elaborate per-channel calibration. The hardware made the format, not the other way around — which is exactly why your own hardware, not a benchmark table, should make your choice.
The decision
- On Hopper or Blackwell, default to FP8. Best quality-per-speed, least fuss, accelerates both phases.
- When VRAM is the binding constraint — squeezing a big model onto few cards, or maximizing memory-bound decode — go INT4/NVFP4 weight-only, and prefer NVFP4 if you're on Blackwell and quantizing activations too.
- On hardware without FP8 tensor cores, use INT8 as your accelerated 8-bit option.
Pick the format that matches your bottleneck and your silicon. The accuracy delta is the last question, not the first.



