The first time you scale an LLM service past one replica, you inherit a decision you've made a hundred times without thinking: what goes in front of the fleet? Every reflex you have points the same way. Put a load balancer there. Round-robin, or least-connections if you're feeling fancy. Spread the requests evenly. It's the most settled question in web infrastructure.
For inference servers, that reflex is not merely suboptimal. It's backwards — and the reason is a piece of per-replica state that classic load balancers were designed to ignore.
The state your balancer can't see#
A vLLM or SGLang replica is not a stateless worker. As it serves requests it builds a KV cache — and, on top of it, a prefix cache that remembers the tokens it has already processed. When a new request arrives whose prompt begins with something a replica has already computed — the same long system prompt, the same retrieved document, the same conversation so far — that replica can reuse the cached KV blocks and skip the prefill step for the shared portion entirely. Prefill is the compute-heavy phase that dominates time-to-first-token; skipping it is most of the latency win in modern serving.
Here's the collision. A round-robin balancer's entire job is to make sure consecutive requests land on different replicas. A prefix cache's entire value comes from making sure requests that share a prefix land on the same replica. The balancer is optimizing for exactly the property that destroys the cache.
The "fair" balancer is the expensive one: even spreading forces every replica to recompute the same prefix that one of its neighbors is already holding.
Scatter a batch of requests that all share a 2,000-token system prompt across eight replicas, and instead of prefilling that prompt once, you prefill it eight times. You are paying for the same computation up to eight times over, and every one of those recomputations is a request waiting longer for its first token.
Everyone converged on the same fix#
What makes this more than a curiosity is that the entire serving ecosystem, working independently, arrived at the same answer in the space of about a year: route on cache locality, not just load.
SGLang shipped a cache-aware load balancer in v0.4 that keeps an approximate radix tree mirroring each worker's real prefix tree, and routes each request toward the worker most likely to already hold its prefix. LMSYS reported up to 1.9x throughput and a 3.8x improvement in cache-hit rate, with the benefit growing as you add workers — the opposite of how naive balancing degrades. The vLLM project released a purpose-built Router in Rust that uses consistent hashing on a session key to pin a conversation to the replica holding its cache; one deployment measured 3x output tokens per second and 2x lower TTFT after enabling it. On Kubernetes, the Gateway API Inference Extension endpoint picker hashes each request's token prefix into an in-memory map of which replica last computed it, and scrapes every vLLM replica's vllm:kv_cache_usage_perc and queue depth to score candidates. And llm-d added the sharpest fork in the design space: an approximate scorer that predicts cache state from traffic, versus a precise one that reads each replica's actual KV block state through a KV-events stream — reporting roughly 2.3x faster workload completion and a ~95% cut in mean TTFT against round-robin on an 8×A10G fleet.
The numbers vary because the workloads do. The pattern doesn't: cache-aware routing turns single-digit tuning into multiples, and it does so precisely on the traffic that hurts most — long shared system prompts, RAG over common documents, multi-turn chat.
The part everyone gets wrong second#
If you stop at "route to the replica with the cache," you've traded one failure for another. Send every request for a popular prefix to the one replica that holds it and you build a hotspot: that replica saturates while the rest of the fleet idles. Pure cache affinity is just a different way to balance badly.
So the actual insight — the one worth carrying even if you never deploy any of these routers — is that inference load balancing is a two-objective problem, and the correct router optimizes cache-hit rate subject to a bound on load imbalance. It maximizes reuse until doing so would skew the fleet past a threshold, then it spreads.
SGLang makes this legible by exposing it as literal knobs. Its default policy is cache_aware, and it ships three parameters: a cache-threshold of 0.3 (match the cache when prefix affinity clears that bar), a balance-abs-threshold of 64, and a balance-rel-threshold of 1.5 (but rebalance the moment two replicas differ by 64 requests, or by 1.5x, whichever the traffic trips first). That single line of configuration is the thesis: be greedy about the cache, but never so greedy that you build a hotspot. NVIDIA's Dynamo KV Smart Router frames the same trade as a cost function blending prefix overlap against decode load, and — per Baseten — held an 89% prefix-cache hit rate across four replicas while running ~2x faster than round-robin.
What to actually do#
If you serve a single replica, none of this applies; enjoy your prefix cache and move on. The moment you run two or more, three things follow.
First, drop the assumption that your existing ingress can do this. An L4/L7 balancer — an ALB, a plain Kubernetes Service, nginx — cannot see prompt prefixes or per-replica cache state, so it will silently do the wrong thing no matter how you tune it. You need a router built for the job: SGLang's sgl-router, the vLLM Router, or the Gateway API Inference Extension picker (which also underlies GKE's Inference Gateway).
Second, if you can, prefer a router that reads real cache state over one that guesses. llm-d's approximate-vs-precise split is the design axis that matters most as fleets grow: predicting cache contents from past traffic is cheap and often good enough, but reading actual KV block state removes the guesswork at the cost of a metrics or events pipeline.
Third — and this is the mental-model correction — stop thinking of the balancer as a fairness device and start thinking of it as a cache-affinity device with a fairness guardrail. The knobs that matter are not "how evenly are requests spread" but "how aggressively do I chase the cache before I'm forced to spread." Get that inversion right and the economics of self-hosting move under you: the same GPUs serve more, faster, because you finally stopped paying to compute the same prefix a dozen times.



