For two years the hard problem in LLM serving was inside the cluster. A single H100 node can't hold a 70B model at useful throughput, so the stack learned to spread one model across many GPUs and route requests intelligently between the shards. It got genuinely good at it. That is the whole story of KServe, llm-d, and Dynamo — and it's a story that ends at the cluster boundary.

The problem is that in 2026, the cluster boundary is not where your GPUs are.

The stack we standardized on#

The consensus architecture is a clean division of labor. KServe is the control plane: it extends Kubernetes with an InferenceService — and, newer, a purpose-built LLMInferenceService CRD — to own lifecycle, autoscaling, and canary rollouts. llm-d is the intelligence layer on top: a Kubernetes-native distributed scheduler that does prefix-cache-aware routing through the Envoy AI Gateway and the Gateway API Inference Extension, so requests that share a prompt prefix land on the replica that already has the KV cache warm. NVIDIA Dynamo plays the datacenter-scale card, orchestrating disaggregated prefill and decode above the engines and claiming large throughput multipliers on reasoning workloads.

CNCF control plane for model serving on Kubernetes; the InferenceService / LLMInferenceService CRDs own lifecycle, autoscaling, and canary rollouts, with pluggable runtimes (vLLM, TorchServe, custom)
★ 5.6kGokserve/kserve
Kubernetes-native distributed inference scheduler; prefix-cache-aware routing via Envoy AI Gateway and the Gateway API Inference Extension — 'if KServe is the control plane, llm-d is the scheduling layer'
★ 3.6kShell/Pythonllm-d/llm-d

Read the three of them together and one word keeps recurring: cluster-wide. The efficiency is measured across the nodes of one Kubernetes cluster. The prerequisites in every tutorial start with "a cluster with at least two nodes." The architecture assumes intra-cluster distribution. Nobody is hiding this — it's simply the design center. And it's the wrong altitude for the market you're actually buying compute in.

Your GPUs are scattered on purpose#

Here is the fact that reshapes the problem. SkyPilot's GPU Compass dashboard, which shops prices across 20-plus clouds, put on-demand H100 pricing anywhere from under \$2/hr to over \$10/hr per GPU in April 2026 — a 5x spread for identical silicon. Spot is wilder: AWS H100:8 spot peaked at \$1.52/GPU and fell to \$0.74 two weeks later, while Nebius and RunPod held near \$1.25. When the same GPU costs 5x more depending on where you rent it, and when capacity queues mean the cheap option is often simply unavailable, no serious team single-sources. You take H100s where you can get them — a home cluster on one cloud, burst capacity on two others, spot pools underneath.

When the same GPU costs 5x more depending on where you rent it, single-sourcing your fleet isn't discipline. It's leaving money and availability on the table.

So the real deployment is n clusters across m clouds. And the single-cluster control plane, however smart, gives you nothing here except multiplication: you now operate KServe plus Knative plus Istio plus llm-d, times every cluster, and stitch the endpoints together with a load balancer you babysit yourself. The intra-cluster brilliance doesn't compose across the fleet. Each cluster is an island that has never heard of the others.

The missing layer is cross-cluster#

The interesting move in 2026 is to put a control plane above the clusters. SkyPilot Endpoints is the clearest expression of it: one YAML deploys the entire serving stack — engine, autoscaler, gateway, certificates, metrics — across any number of Kubernetes clusters, presented as a single endpoint URL. When autoscaling exhausts the home cluster's GPUs, the next replicas land on the next cluster that has free GPUs. When a cluster dies, its replicas are recreated on healthy ones and the URL never changes. It runs the same kind of engines underneath — it isn't trying to out-schedule llm-d inside a cluster — it's solving the layer those tools skip.

Run and scale AI workloads across 20+ clouds, Kubernetes, and Slurm from one YAML; SkyServe/Endpoints add cross-cluster LLM serving with autoscaling, spot recovery, and single-URL failover
★ 10.2kPythonskypilot-org/skypilot

The lineage matters, because cross-cluster serving on cheap capacity only works if you survive spot preemption. SkyPilot's SkyServe research (the SpotHedge policy) is the receipt: serving models like Llama-2-70B on vLLM across spot instances in multiple regions, it held failure rates and latency low while cutting cost substantially against on-demand — roughly the 3–6x that managed spot promises when preemption recovery actually works. Cross-cluster placement isn't just a cost trick; it's what makes the volatile-but-cheap market usable at all.

The decision#

Keep KServe, llm-d, and Dynamo — they earn their place inside each cluster, and nothing here replaces cache-aware routing or prefill/decode disaggregation. The question to ask is one level up: who owns the fleet? If your answer is "a load balancer and a runbook," you've stopped at the cluster line right where the 2026 GPU market stops cooperating. The abstraction worth adopting this year isn't a better in-cluster scheduler. It's the one that treats a dozen clusters on a dozen clouds as a single place to serve from.