There is a question that shows up in every infra Slack, every procurement doc, every "help me pick" thread, and it is the wrong question: NVIDIA Dynamo vs vLLM — which should we use?

It's wrong the way "should I buy a car or an engine" is wrong. You don't choose between them. One goes inside the other.

vLLM is an inference engine. So are SGLang and TensorRT-LLM. An engine takes a model and a pool of GPUs and serves requests out of a single replica — it owns the attention kernels, the continuous batching, the prefix cache, the tensor parallelism that splits one model across the cards in a box. It is the thing that actually runs the math. With 83.7k GitHub stars, vLLM is the engine the field defaulted to, and if you've read our vLLM vs TensorRT-LLM vs TGI or vLLM vs SGLang vs Ollama pieces, you already know how to pick one.

NVIDIA Dynamo and llm-d are not engines. They are orchestrators. They sit a layer up, above a fleet of those engines, and decide which replica gets which request, how many prefill workers and how many decode workers to run, and where the KV cache lives. Dynamo runs vLLM underneath it. So does llm-d. Asking "Dynamo or vLLM" is asking which floor of the building you'd like to live on.

An engine serves one replica. An orchestrator decides what a thousand replicas do. The category error is treating those as competitors instead of layers.

The technique that makes orchestration worth it

The reason this layer exists at all is one architectural move: disaggregated serving.

LLM inference has two phases that look nothing alike. Prefill reads your whole prompt at once — compute-bound, bursty, hungry for raw FLOPs. Decode generates tokens one at a time — memory-bandwidth-bound, latency-sensitive, and it crawls along holding the KV cache for the life of the request. Run both on the same GPU and they fight. A long prefill stalls everyone's decode; idle decode starves the prefill units.

Disaggregation splits them onto separate GPU pools. Prefill workers chew prompts; decode workers stream tokens; each pool scales on its own demand curve. The catch is that the KV cache computed during prefill has to physically move to the decode worker, fast, or the whole idea collapses under transfer latency.

That's what NIXL — the NVIDIA Inference Xfer Library — is for. It shuttles KV-cache tensors from prefill GPUs to decode GPUs over RDMA, InfiniBand, or NVMe at wire speed. Here is the detail people miss in the rivalry framing: both Dynamo and llm-d use NIXL. It's shared plumbing. The orchestrators differ in scheduling and surface, not in how the cache moves.

Dynamo: the NVIDIA-stack orchestrator

NVIDIA Dynamo went GA at 1.0 on March 16, 2026 at GTC, pitched as the "operating system" for AI factories. The important and slightly counterintuitive fact: Dynamo is open source (Apache 2.0, mostly Rust with Python and Go) and backend-agnostic — it orchestrates vLLM, SGLang, and TensorRT-LLM. NVIDIA built the orchestration layer and let you bring whichever engine you like underneath. At 7.3k stars it's young, but adoption is moving fast.

What Dynamo adds above the engine:

The headline number you will see quoted everywhere deserves a label. NVIDIA reports up to 30x more requests served on the open DeepSeek-R1 model running on GB200 NVL72, via disaggregated serving — and roughly 2x on Llama on Hopper. That is NVIDIA's own benchmark, on NVIDIA's newest hardware, on a model chosen to flatter the architecture. It is a real result and a vendor result. The 30x is the ceiling of an ideal case, not a number you should put in a capacity plan before you've run your own traffic.

llm-d: the Kubernetes-native orchestrator

llm-d (Apache 2.0, 3.4k stars) is the Red Hat-led answer, announced at Red Hat Summit in May 2025 with CoreWeave, Google Cloud, IBM Research, and — note this — NVIDIA among the founding contributors. It solves the same problem from the other end. Where Dynamo is the NVIDIA stack's orchestrator, llm-d is Kubernetes-native and vendor-neutral: vLLM-based serving, an Inference Gateway for KV-cache-aware routing, and the same disaggregated prefill/decode model, also riding NIXL for cache transport. Red Hat's own figure is 70% higher tokens/sec from P/D disaggregation versus a flat vLLM deployment — again, a vendor number, again worth reproducing before you trust it.

The philosophical split is the same one that runs through all infrastructure: bet on one vendor's integrated stack, or bet on the open, portable, slightly-more-assembly-required layer. Dynamo is the deepest integration with NVIDIA silicon. llm-d is the choice if your fleet already lives on Kubernetes and you don't want your inference layer married to one accelerator. That NVIDIA contributes to both, and that both depend on NIXL, tells you the war is over the control plane, not the cache.


When you need none of this

The most honest thing in this piece: most teams reading it should serve a model with vllm serve and walk away.

If you run one model on one GPU, or a single node, at modest QPS, a single vLLM instance with continuous batching and prefix caching will saturate your hardware and your SLO at the same time. Disaggregation, NIXL, KV-aware routing — all of it is overhead you pay to coordinate across nodes. Below that scale it's pure cost: more moving parts, more failure modes, more 3 a.m. pages, for throughput you weren't going to use. (Sizing the single-box case is its own question — see how much VRAM to serve an LLM.)

Reach for an orchestrator when one replica can no longer hold your traffic and you're running pools of GPUs across machines. Not before. The orchestration layer earns its keep at the fleet, and nowhere smaller.

So the decision tree is shorter than the marketing implies. Single node, single model, normal load: vLLM, full stop. Multi-node fleet where prefill and decode want to scale apart: now you're choosing an orchestrator — and that choice is Dynamo (NVIDIA stack, backend-agnostic) versus llm-d (Kubernetes-native, vendor-neutral), not orchestrator versus engine. The two layers were never on the same shelf.