Send an LLM request to a round-robin load balancer and it will do exactly what it was built to do: pick the next pod in the ring, regardless of whether that pod is three requests deep in a queue with its KV cache thrashing while the pod beside it sits idle. For a stateless web request that's fine — any replica is as good as any other. For an inference request it's close to malpractice, because the thing you are routing to is not interchangeable. It's a GPU holding warm caches and loaded adapters, and its cost to serve your request depends entirely on what it's already doing.
Kubernetes just made the fix a stable, standard API. The Gateway API Inference Extension — GIE, from SIG-Network — teaches the gateway to route by what the model servers are actually experiencing. Its InferencePool type is GA, served under inference.networking.k8s.io/v1. This is not a proposal or a CRD you install off a blog post; it's the same tier of Kubernetes API as a Service.
Why an inference request breaks a normal gateway#
The premise is that LLM traffic violates every assumption a traditional balancer makes. Requests are long-running, not milliseconds. They're GPU-bound, so an overloaded backend doesn't just get slow, it starts evicting caches and queueing. And they're partially stateful: a server holds in-memory token caches across a session and may have specific LoRA adapters resident. A balancer that sees only HTTP paths and health checks is blind to all of it. It cannot tell a warm replica from a saturated one, and it has no concept that an interactive chat should outrank a batch job for the same model.
A round-robin balancer treats GPU pods as interchangeable. They are the one thing in your stack that is least interchangeable of all.
Two roles, one picker#
GIE splits the problem along the org chart. An InferencePool groups the pods that share an accelerator type, base model, and model server — the platform operator's object, the "where and how it's served." An InferenceModel (evolving toward InferenceObjective on some platforms, like GKE) is the model owner's object: what model is served, its version, and its criticality, so the system knows a chat request and a bulk-embedding job are not equals. That division is quietly the best part of the design: the two teams that always fight over a serving stack finally own separate, well-defined halves.
The routing itself is delegated. When a request hits the gateway, the proxy — Envoy, typically — doesn't pick a pod. It consults an Endpoint Picker (EPP) over Envoy's external-processing (ext-proc) protocol. The EPP is watching live metrics from every pod in the pool — queue length, KV-cache utilization, which adapters are loaded — and hands back the endpoint that will serve the request with the least added latency, avoiding the eviction-and-requeue spiral that kills tail latency under load. The gateway then forwards directly to that pod. The scheduling algorithm is explicitly kv-cache- and request-cost-aware. Round-robin never had a chance.
It's already the substrate#
The reason to care now rather than later is that the ecosystem has quietly standardized on it. Istio v1.28 ships full InferencePool v1 support. NGINX Gateway Fabric becomes an inference gateway when paired with the Endpoint Picker. Agentgateway implements it as a Gateway API provider. And GKE Inference Gateway is simply the llm-d router operating in gateway mode — the Gateway calls the EPP over ext-proc, then forwards to the chosen pod. If you're running the KServe-plus-llm-d stack, you are already standing on GIE whether or not you've read the spec.
There's a governance detail worth noting, because it tells you the project understands the difference between a standard and a product. The Endpoint Picker's logic — the actual scheduling intelligence — is migrating out to the llm-d/llm-d-inference-scheduler project, while the Kubernetes repository keeps the lightweight EPP, the InferencePool API, and the conformance tests. That's the right seam: the API and its conformance suite live in a vendor-neutral home, and the competitive routing algorithms live where implementers can iterate fast. It's the same lesson the OpenTelemetry GenAI conventions are learning from the observability side — standardize the interface, let the implementations race.
The shift underneath#
The headline isn't a new CRD. It's that the load balancer stopped being a pass-through. For a decade the gateway's job was to not think — spread connections, check liveness, get out of the way. Inference inverts that. The gateway now has to understand models, priorities, cache state, and failure modes, because those are the variables that decide whether a GPU-hour is spent well or wasted. GIE is Kubernetes conceding the point and giving that intelligence a stable place to live. If you're serving models on Kubernetes and still routing them like web pages, the standard has moved on without you.



