Read the name of NVIDIA's new flagship open model and you already know the pitch: Nemotron 3 Ultra, 550B-A55B. Five hundred fifty billion parameters total, fifty-five billion active on any given token. That is a 10:1 sparse-to-active ratio, and by mid-2026 it is not, on its own, interesting. DeepSeek and Kimi have shipped models sparser than that. If the story were "big pool of experts, small slice per token," Nemotron 3 would be a footnote in a crowded quarter.
The backbone is a hybrid Mamba-Transformer, which is efficient but by now familiar. The story is one layer down, in how those experts are addressed. NVIDIA calls it Latent MoE, and it is the rare architectural idea that changes an economic constant rather than a benchmark number.
The tax nobody advertises#
Standard Mixture-of-Experts models sell a comforting arithmetic: you only pay for the experts you activate. A 550B model that lights up 55B per token should cost about what a dense 55B model costs. In FLOPs, roughly true. On real hardware, not quite — because the binding constraint on MoE inference is usually not compute, it's memory bandwidth.
Every expert is a distinct block of weights sitting in HBM. To activate it, you stream it onto the compute units. Add more experts and you add more distinct weight blocks to move, more routing to arbitrate, more scatter-gather across the interconnect. The dense-model comparison quietly assumes those weights are free to fetch. They are not. This is why "just add more experts" — the obvious way to buy capacity — runs into a wall that has nothing to do with the FLOP budget you were reasoning about.
A standard MoE pays its tax once per activated expert. Latent MoE pays it once, in the shared projection, and then the experts are cheap.
What Latent MoE actually moves#
Latent MoE's move is to stop doing expert work in the full-width hidden state. Instead, the token is first projected into a smaller, shared latent representation; routing and the experts themselves operate there; the result is projected back up to token space. The experts live in the compressed room, not the full one.
The consequence is the whole point. The expensive, bandwidth-bound projection between token space and latent space is paid once, shared across all experts, rather than re-paid for every expert you touch. Inside the latent space, an individual expert is small and cheap to address. So the quantity that used to scale painfully — number of experts — is now largely decoupled from the quantity you actually pay for. NVIDIA's reports put the trade concretely: roughly 4x more experts at the same inference cost. You are not buying more experts by spending more bandwidth; you are buying them by spending the fixed cost of the latent projection you were already paying.
That is a different kind of claim than "we scored higher." It says the price of capacity, in this architecture, is lower — and capacity is exactly what a Mixture-of-Experts is for.
Why an agent desk cares#
The benchmarks are strong and mostly beside the point. Ultra posts 48 on the Artificial Analysis Intelligence Index, the highest of any US open-weight model at release — ahead of Gemma 4 31B at 39 and GPT-OSS 120B at 33 — with MMLU above 90%, a ProfBench 56.0 that ties the trillion-parameter Kimi-K2.6, and an IOI 2025 score of 570.0, in the neighborhood of top-three-human competitive programming. Good numbers. Not the reason to read this.
The reason is the number NVIDIA buries in the efficiency section: about 30% lower per-task token cost on long-running workloads, at 300+ tokens/sec, with a 5x NVFP4 speedup on Blackwell. Agentic systems are the workload where that compounds. An agent is not one clever answer; it is a long trajectory — dozens of turns, tool calls, re-reads of a growing context — and per-task cost is multiplied across every step and every run. A 30% reduction there is not a rounding error; it's the difference between a workflow that pencils out at scale and one that doesn't.
And Latent MoE is what makes that 30% structural rather than promotional. The model is small where it counts (55B active, latent-space experts) and large where it helps (550B of specialized knowledge), pretrained in NVFP4 so the Blackwell speedup is native, not bolted on.
The part that isn't in the checkpoint#
Here is the detail that should reframe how you read the release. NVIDIA didn't only publish Ultra's weights on Hugging Face; it published the training recipe and datasets. Combine that with an architecture co-designed with NVFP4 on Blackwell, and the strategy comes into focus. The weights are the giveaway. The moat is the hardware-aware design and the pipeline to reproduce it — a MoE variant whose economics only fully cash out on NVIDIA's own silicon.
So the useful way to evaluate Nemotron 3 is not "is it the smartest open model." It is: does an architecture that decouples expert count from inference cost change what you can afford to run? For a team pointing thousands of daily agent calls at a model it can specialize and self-host, that question is the whole product — and Latent MoE is the first answer in a while that isn't just a bigger number.



