A trillion-parameter model sounds like a serving nightmare until you notice how little of it runs at once. Kimi K2 carries a trillion parameters but fires only 32 billion of them per token. DeepSeek-V3 and R1 are 671B total and activate 37B. Qwen3-235B-A22B is exactly what its name says: 235B on disk, 22B awake per token. These are mixture-of-experts models — the feed-forward block is split into many experts, a small router picks a handful per token, and the rest sit dark.

That sparsity is the whole reason these models are affordable to run. But it also breaks the serving playbook. The usual way to spread a model across GPUs — tensor parallelism, slicing every weight matrix into shards — fights the structure of an MoE. If only 8 of 256 experts fire for a given token, slicing all 256 across every GPU means each device is mostly holding and synchronizing weights that won't be touched. You pay full communication cost for a fraction of the compute.

The layout that matches the sparsity#

Expert parallelism inverts the deal. Instead of splitting each expert across GPUs, you keep each expert whole and put different experts on different GPUs. NVIDIA's TensorRT-LLM docs put the distinction cleanly: under TP, every GPU holds a partial slice of all experts and receives every token's hidden state; under EP, every GPU holds the full weights of a subset of experts and receives only the tokens routed to those experts. The router decides where each token needs to go, and the system delivers it there.

That delivery is the catch. An MoE layer under EP needs two all-to-all collectives. First a dispatch: every GPU ships its tokens to whichever GPUs own the experts those tokens picked. Then the experts run. Then a combine: the outputs get shipped back to the GPUs the tokens came from. This happens on every MoE layer, for every token, in both directions.

The expert math is cheap. Moving the tokens to the right GPU and back is what costs you. Expert parallelism is a networking problem wearing a model-architecture costume.

Inside a single server, where GPUs talk over NVLink, all-to-all is fast. The moment experts span multiple nodes and that traffic crosses RDMA or Ethernet, the all-to-all becomes the dominant cost — roughly an order of magnitude slower per byte than the intra-node NVLink path. This is why DeepSeek wrote and open-sourced DeepEP, a communication library of all-to-all kernels tuned specifically for MoE dispatch and combine, with separate high-throughput kernels for prefill and low-latency kernels for decode, and a hook-based mechanism that overlaps the communication with computation so the GPUs aren't stalling on the wire.

The hot-expert problem#

There's a second failure mode that's subtler and, in practice, nastier. Routing is learned, and learned routing is not uniform. Some experts are popular; others rarely fire. Spread experts evenly across GPUs and you get the opposite of an even load — the GPU holding a hot expert drowns while its neighbors idle, and because every layer ends in an all-to-all barrier, everyone waits for the slowest GPU.

DeepSeek's answer, also open-sourced, is EPLB — an Expert Parallelism Load Balancer. It estimates each expert's load and computes a placement plan that replicates the hot experts onto extra GPUs so their traffic is shared, packing the duplicates to level per-GPU utilization. It even uses different strategies for the two phases: a hierarchical pack for the smaller expert-parallel width of prefill, a global replication for the very wide decode. Redundant experts are not waste here; they're how you stop one popular expert from setting the clock for the whole cluster.

What this looks like in production#

DeepSeek published the actual shape of their serving system, and it's the clearest map available. Prefill runs the routed experts at EP32 across four nodes; decode runs them at a striking EP144 across eighteen. Crucially, they don't expert-parallelize everything: attention is dense, not sparse, so they run it data-parallel (each engine owns its own request shard and KV cache) while only the experts go expert-parallel. That pairing — EP for the FFN, DP for attention — is now the standard large-MoE layout, and vLLM and SGLang both implement it for the same reason: tensor-parallelizing MLA-style attention duplicates the KV cache and shrinks your batch, which is exactly what you don't want. On that system DeepSeek reports roughly 73.7k input tokens/sec and 14.8k output tokens/sec per H800 node.

The open ecosystem has caught up fast. The SGLang team reproduced DeepSeek's full system — PD disaggregation, wide EP, DeepEP, EPLB, two-batch overlap — on 96 H100s and reported per-node throughput in the same ballpark. vLLM ships Wide-EP (expert parallel plus DP attention plus dual-batch overlap to hide the all-to-all), and NVIDIA's Dynamo pushes wide EP onto the GB200 NVL72, whose 72-GPU NVLink domain is, not coincidentally, a way to keep more of that all-to-all on the fast wire.

When it's worth it — and when it isn't#

Here's the part the architecture diagrams don't tell you. Wide expert parallelism only lowers your cost per token at high concurrency. The whole point is to spread experts across many GPUs — but a scattered expert is only earning its keep if tokens are constantly arriving for it. Run a wide-EP deployment at low traffic and most of your experts sit idle each step while you still pay the all-to-all tax to talk to them. That's why DeepSeek reserves its widest layout (EP144) for decode, where aggregated concurrency across many users is highest, and uses a narrower EP32 for prefill.

So expert parallelism is not a latency trick or a way to fit a model on fewer GPUs — for the latter, see how much VRAM a model actually needs, and for the prefill/decode split that EP rides on top of, the two speeds of inference. EP is a throughput weapon for high-volume serving of large sparse models. If you're running DeepSeek or Kimi K2 for a busy product, it's how the economics close. If you're serving a few requests a second, it's overhead pretending to be scale. The sparsity that makes these models cheap to think with is the same sparsity that makes them expensive to serve unless you keep them busy — and keeping them busy is the entire game.