The first time you put an LLM behind a Kubernetes Deployment and turn on the Horizontal Pod Autoscaler, it does nothing useful, and it does it confidently. Traffic doubles, p99 latency climbs into the seconds, and the autoscaler sits there reporting that everything is fine — because you told it to watch CPU, and the CPU is bored.
This is the core trap of serving models on Kubernetes: the default autoscaling signal is measuring the wrong machine. A Llama-3.1-70B pod can idle near 5% CPU utilization while its 80 GB of VRAM is saturated and a queue of requests is stacking up behind it. By the time CPU climbs high enough to trip the HPA, you are already deep into degraded latency. The autoscaler is honest; it's just answering a question nobody asked.
GPU utilization is the wrong fix#
The intuitive correction is "okay, scale on GPU utilization instead." It's better than CPU, and it's still wrong — for a reason that connects directly to why LLM serving capacity is a memory problem, not a FLOPs one.
Decode is memory-bandwidth bound. The GPU spends most of a generation step shuttling weights and KV cache through memory, not saturating its compute units. So streaming-multiprocessor utilization — the number DCGM hands you — reads "busy" across a wide band of actual latency. It crosses your threshold well before users feel anything and stays pinned well after, which makes it a mushy proxy for the only thing you care about: am I keeping up? And it has a blind spot it can never fix — a high utilization number cannot tell you a replica is full. Fullness is about KV-cache headroom, which the utilization dial doesn't see.
Scale on the work that's waiting#
The signal that doesn't lie is the one the inference server already computes to schedule itself: the request queue. vLLM publishes it on its Prometheus endpoint as vllm:num_requests_waiting — a gauge of requests that have been admitted but not yet scheduled onto the GPU. A rising waiting-queue is the earliest truthful sign that your current replicas can't drain work as fast as it arrives.
Utilization tells you how busy the hardware looks. Queue depth tells you whether you're losing. Autoscale on the second one.
Scaling on queue depth is also proactive in the way the others aren't: the queue grows the instant arrival rate outpaces service rate, before latency has fully collapsed, so the new replica is spinning up while you still have a buffer. A workable target is "keep average waiting requests per replica below N," tuned to your latency budget.
The saturation metric is a different metric#
Here is the part most autoscaling configs get muddled: queue depth and the other number everyone quotes — vllm:gpu_cache_usage_perc, the fraction (0–1) of KV-cache blocks in use — are answering two different questions, and you need both.
Continuous batching packs more concurrent sequences onto one GPU only while there's KV-cache headroom. As cache fill approaches 1.0, that replica starts preempting or queueing no matter what its utilization looks like. So cache fill is your per-replica saturation guard — it caps concurrency and feeds your alerts. It is not your scale-out trigger. Queue depth says add a replica to the fleet; cache fill says this one replica is out of room. Wire the first to KEDA and the second to your concurrency limits and dashboards, and don't cross the streams.
Why KEDA, and the scale-to-zero seam#
Native HPA can't do the thing that makes GPU autoscaling worth it: scale to zero. An idle H100 bills like a busy one, so the whole economic case for autoscaling rests on being able to drop to zero replicas between bursts — and HPA's floor is one. KEDA, a CNCF graduated project, fills that gap. Its operator owns the 0↔1 transition through activationThreshold and hands 1↔N back to HPA through threshold, so a single config gives you scale-to-zero and ordinary horizontal scaling. Point its Prometheus scaler at vllm:num_requests_waiting, set a cooldownPeriod (default 300s) so a brief lull doesn't yank the fleet to zero, and you have an autoscaler watching the right number.
One catch closes the loop. The moment you allow zero, you re-expose the cold start — the seconds-to-minutes of loading tens of gigabytes of weights into empty VRAM — as your new tail latency. Scale-to-zero is correct for bursty, latency-tolerant traffic. For anything latency-critical, keep a warm floor of one replica and only let the queue scale you above it.
The whole discipline reduces to three sentences. Autoscale on the queue. Cap concurrency on the KV cache. Never scale on the dial that looks busy, because on a memory-bound machine, busy is not the same as keeping up.



