Every serving stack eventually arrives at the same fork: the model is trained, the GPUs are rented, and someone has to decide how many bits each number gets. The menu reads FP8, INT8, INT4 — three rows that look like three settings on one quality dial, from "barely lossy" to "aggressively small." That framing is the mistake. These formats are not three points on one axis. They pay down different bottlenecks, and if you pick by accuracy leaderboard alone you will optimize the wrong one.

What each format is actually quantizing

Start with the detail the marketing hides: what gets converted.

When people say INT4, they almost always mean weight-only quantization — the W4A16 recipe behind AWQ and GPTQ. The weights drop to about half a byte each; the activations flowing through the network stay 16-bit. (This is the same weight-only logic as the GGUF/GPTQ/AWQ formats local runners ship.) FP8 and INT8, by contrast, are usually W8A8 — both the weights and the activations are 8-bit. That single difference decides everything downstream.

INT4 makes the model smaller. FP8 makes the model's math faster. Those are not the same purchase, and most teams buy one while thinking they bought the other.

The two bottlenecks, and which format touches which

LLM inference has two phases with opposite characters. Prefill — chewing through the prompt — is compute-bound: the GPU's tensor cores are the limit. Decode — emitting tokens one at a time — is memory-bandwidth-bound: the chip spends most of its time streaming weights out of VRAM, not multiplying.

Now map the formats onto that:

So FP8 and INT4 aren't really competitors. FP8 is the faster-math play; INT4 is the fit-it-and-feed-decode play. INT8 is what you reach for when FP8 hardware isn't there.

The exponent is the whole story

Here is the part worth keeping. Why does FP8 quantize activations more gracefully than INT8, and NVFP4 beat plain INT4? Because floating-point formats spend bits on an exponent. INT8 lays 256 values down at even spacing; FP8 places them exponentially, with fine resolution near zero and a long reach toward the extremes. Transformer activations are full of outliers — a few values orders of magnitude larger than the rest — and a format with wide dynamic range swallows them where an evenly-spaced integer grid clips or coarsens them. That's the mechanism, not a vibe: it's why FP8 lands ~0.3–0.5 points off FP16 while INT8 gives up ~0.5–0.9, and why NVIDIA's NVFP4 — a 4-bit float with a two-level FP8/FP32 scaling scheme — recovers accuracy that naive INT4 leaves on the floor.

The counterintuitive footnote

If floating point is so good for transformers, why did anyone ever favor integers? Because in dedicated silicon at a fixed accuracy, INT8 is the more area- and power-efficient choice — that was the headline of a careful 2023 Qualcomm study comparing FP8 and INT8 for inference. FP8 didn't win the deployment war on intrinsic efficiency. It won because NVIDIA put FP8 tensor cores in Hopper, and because FP8 is the format you can post-train-quantize to without elaborate per-channel calibration. The hardware made the format, not the other way around — which is exactly why your own hardware, not a benchmark table, should make your choice.

The decision

Pick the format that matches your bottleneck and your silicon. The accuracy delta is the last question, not the first.