There are exactly two ways to store a number in four bits and still call it a float, and the AI industry has, characteristically, shipped both at once. NVFP4 and MXFP4 are not rival encodings of the value — they encode each weight identically. What they disagree about is the receipt: the little scaling factor stapled to each block of numbers. That receipt is the entire ballgame, and almost nobody leads with it.

So let's lead with it.

The part that's the same: E2M1#

Both formats store each individual weight as E2M1 — one sign bit, two exponent bits, one mantissa bit. Four bits, total. That gives you a grid of representable magnitudes running from a smallest subnormal of ±0.5 up to a largest normal of ±6.0, per the OCP Microscaling spec. It is a comically small number of distinct values. On its own, E2M1 is useless for representing a weight matrix whose entries span several orders of magnitude.

The reason it works anyway is the same reason INT4 works: you don't ask one number to cover the whole range. You chop the tensor into small contiguous blocks, and each block shares a scale factor. Multiply the tiny E2M1 codes by the block's scale and you recover real magnitudes. This is micro-scaling, and it's where NVFP4 and MXFP4 part ways.

The part that's different: the block scale#

MXFP4 — the "MX" is OCP's Microscaling — uses a block of 32 elements sharing a single scale in E8M0 format: eight exponent bits, zero mantissa. That is a power-of-two scale and nothing but. It can say "multiply this block by 2^k," and it cannot say anything in between. Cost: 8 bits of scale amortized over 32 values, so about 4.25 bits per weight.

NVFP4 makes two changes. First, it halves the block to 16 elements. Second, and more important, it stores the per-block scale as a real FP8 E4M3 float — four exponent bits, three mantissa bits — so the scale itself can land between powers of two. Then it adds a second, global FP32 scale per tensor to set the overall range. Two E4M3 scale bytes per 16 values works out to roughly 4.5 bits per weight, per NVIDIA's inference write-up.

Smaller blocks see less of the tensor, so one outlier poisons fewer neighbors — and a scale with a mantissa can actually fit the block it's scaling instead of rounding to the nearest power of two.

That sentence is the whole article. A power-of-two scale (MXFP4) frequently has to round up to cover a block's largest value, which throws away precision on everything smaller in the block. A mantissa-bearing scale (NVFP4) hugs the real maximum. Do that across a 16-wide window instead of a 32-wide one and outliers stay quarantined. The extra quarter-bit of metadata is not free, but it is cheap relative to what it buys.

Why a float beats an integer here#

If you've shipped FP8 vs INT8 vs INT4 quantization, the instinct is to ask why not just use INT4 and skip the exotic float. The answer is dynamic range. INT4 lays down a uniform grid — sixteen evenly spaced steps. Weight distributions are not uniform; they're peaked near zero with a long tail. A uniform grid spends most of its sixteen codes in regions where there's little to represent and starves the tail. E2M1's two exponent bits give it a non-uniform, roughly log-spaced grid: fine resolution near zero, coarse resolution out where the big values live — which is exactly the shape of a weight histogram. Same four bits, better-placed codes. That's the entire pitch for FP4 over INT4, and it's why the integer-quant tooling you know from GGUF vs GPTQ vs AWQ doesn't transfer one-to-one.

The silicon makes it real#

None of this matters at inference speed unless the hardware does FP4 natively, and that's the Blackwell story. Its 5th-generation Tensor Cores and second-gen Transformer Engine execute 4-bit floating-point matmuls in silicon, with the micro-tensor scaling baked in. A single B200 delivers on the order of 20 PFLOPS of FP4; a GB200 NVL72 rack pushes roughly 720 PFLOPS, per NVIDIA's Blackwell architecture page. On Hopper and older, there is no native FP4 path — you're either upcasting on the fly (how gpt-oss runs MXFP4 weights on an 80GB H100) or back on INT4. If you're sizing a deployment, that hardware floor changes the math in how much VRAM to serve an LLM: four-bit weights roughly halve the parameter footprint versus FP8.

Does the accuracy actually hold?#

This is where NVFP4 earns the extra quarter-bit. In NVIDIA's NVFP4 pretraining study, a 12B hybrid Mamba-Transformer was trained on 10 trillion tokens entirely in NVFP4, with validation loss tracking the FP8 baseline to within about 1% through the stable phase. Downstream, the gaps are noise-level: MMLU 76.57 vs 77.36, MMLU-Pro 62.58 vs 62.62, with NVFP4 occasionally ahead (AGIEval English CoT 70.31 vs 67.01). It leans on Random Hadamard Transforms, 2D block scaling, and stochastic rounding to get there — this is not a free lunch you get by flipping a dtype. MXFP4, with its coarser power-of-two scale, is the format that tends to lose the points NVFP4 holds; in the same family of experiments NVFP4 reaches comparable loss with fewer tokens.


The case for MXFP4 anyway#

And yet MXFP4 is not the loser here, because accuracy isn't the only axis. MXFP4 is the open, vendor-neutral standard — ratified by the Open Compute Project with AMD, Arm, Intel, Meta, Microsoft, NVIDIA and Qualcomm all in the room. It runs on AMD's MI-series too. And it's already in the wild in a way NVFP4 isn't: OpenAI's gpt-oss-120b ships pre-quantized in MXFP4, MoE weights and all, which is what lets a 117B-parameter model land on a single 80GB GPU. When the weights arrive in MXFP4, you serve MXFP4. The format choice was made upstream.

Software, briefly#

On the NVFP4 side the toolchain is mature enough to use today. NVIDIA TensorRT Model Optimizer and vLLM's llm-compressor both produce NVFP4 checkpoints (W4A4 — four-bit weights and activations), export to a unified Hugging Face checkpoint, and load directly into vLLM, TensorRT-LLM, and SGLang. NVIDIA publishes pre-quantized NVFP4 checkpoints (e.g. nvidia/Llama-3.1-8B-Instruct-NVFP4). For MXFP4, you're mostly consuming what ships rather than producing it.

The verdict#

If you're on Blackwell and quantizing your own weights for production, NVFP4 is the default. It's the accuracy-preserving choice, the tooling is real, and the silicon was designed for it. Pick MXFP4 when the model already ships in it, when you're targeting non-NVIDIA accelerators, or when you want a format that won't strand you on one vendor's roadmap. The two formats aren't really competing for the same job — one optimizes for how good four bits can be on the hardware that made them, the other for how widely four bits can travel. Read the receipt on the block, and the choice makes itself.