There is a question that decides the cost of a self-hosted agent before a single token is generated, and most teams answer it by accident: dense or mixture-of-experts? The two architectures fail in opposite directions, and the trap is that an MoE model looks strictly better on the spec sheet — bigger, smarter, cheaper to compute — right up until the VRAM invoice arrives.
One model, two parameter counts
A dense transformer runs every token through every weight. A mixture-of-experts model breaks that assumption: it replaces each dense feed-forward layer with a bank of parallel "experts" and adds a router that sends each token to only the top one or two of them. The Switch Transformer made the canonical version of this point — you can scale the parameter count by orders of magnitude while holding per-token compute roughly constant, because hard routing means you never execute the experts a token didn't select.
The consequence is that an MoE model has two sizes, and you have to track both:
- Total parameters — every weight in the model. All of it must be loaded into memory.
- Active parameters — the subset a single token actually flows through. This is what determines compute.
Mixtral 8×7B is 47B total but only ~13B active per token. DeepSeek-V3 is 671B total and 37B active. The number you brag about and the number you compute with are not the same number — and, crucially, neither is the number you have to fit in VRAM.
The win is real: compute tracks the small number
Start with the good news, because it is genuinely good. Because only the active experts run, an MoE's inference compute tracks its active size. DeepSeek-V3, at 671 billion total parameters, costs about what a dense 37B model costs to run per token — while scoring like something vastly larger. Mixtral made the same trade legible a year earlier: it matched Llama-2 70B on most benchmarks at roughly one-fifth the inference compute. That is a better quality-per-FLOP curve than any dense model can offer, and it is why the frontier open-weight releases are almost all sparse now.
If you are running a high-throughput inference platform, this is the whole game. You keep the weights resident, you batch many concurrent requests across the expert bank, and you get frontier quality at a fraction of the FLOPs.
The trap is the other number
Here is what the spec sheet doesn't lead with. Which experts a token needs is decided at runtime, per token — so every expert has to be loaded and ready, all the time. You cannot keep only the "active" ones in memory, because the next token will route somewhere else. The full parameter count, idle experts and all, sits resident in VRAM.
For DeepSeek-V3 that means roughly 1,500GB in FP16, or about 386GB even quantized to 4-bit — a multi-GPU rack, not a single accelerator. A dense 37B model, with the same active compute, fits comfortably on far less and can be quantized onto a single card.
An MoE computes like a small model and remembers like a giant one. You pay for the compute you use and the memory you don't.
It's a utilization decision wearing an architecture costume
This is the inversion that catches agent builders. The usual self-hosting intuition — "smaller is cheaper to serve" — quietly assumes dense models, where one size governs both compute and memory. MoE splits those, and the economics flip depending on how busy you keep the weights.
Spread that enormous resident memory across thousands of concurrent requests per second and the per-request memory cost rounds to nothing while the compute savings dominate: MoE wins, decisively. Pin the same weights for a single agent that makes occasional, bursty calls, and you are renting a multi-GPU box to keep mostly-idle experts warm for traffic that never fills them. Now the dense model in the MoE's active-parameter range — same compute, a fraction of the hardware — is the cheaper machine, and it isn't close.
So the real question isn't "is MoE better than dense?" It's "will this agent keep the weights hot?" A platform serving many agents should reach for MoE and let throughput amortize the memory. A solo self-hosted agent, or anything with spiky low-volume traffic, is usually better off dense — or renting the MoE from someone who is running it hot, so you pay per token instead of per idle GPU-hour. The architecture you can afford is a function of your utilization, not your benchmark envy.
Parameter counts and the Mixtral-vs-Llama-2 comparison are each paper's published figures; VRAM estimates are standard FP16/4-bit calculations from the model's total parameter count and vary with serving stack, context length, and KV-cache budget. No live pricing is quoted.



