For a year the cheap-GPU conversation has been stuck on the wrong workload. "Spot instances are terrifying," the lore goes, "the cloud yanks them mid-run and you lose everything." That is a true story — about training. A preempted training job that wasn't checkpointing throws away hours of gradient descent, and the engineering to survive it (frequent checkpoints, resume logic, elastic schedulers) is real work.
Inference is not that workload, and the difference is the whole argument. A stateless inference replica holds nothing you can't rebuild in seconds. When the cloud reclaims it, you lose the requests currently in flight and a KV cache that was going to be evicted anyway. There is no checkpoint because there is no progress to save. So the thing that makes spot scary for training is categorically absent for serving — and almost nobody prices that in.
What you're actually buying, and what it costs you back#
The discounts are real and large. AWS markets EC2 Spot at up to 90% off On-Demand; Google quotes 60–91% for Spot VMs. Those are ceilings, not the number you'll see on an H100, but even the realistic middle of that range changes the unit economics of a serving fleet.
The price of admission is the interruption contract. AWS gives you a two-minute warning, delivered through EventBridge and the instance metadata service, and is explicit that it's best-effort — occasionally the instance goes before the notice lands. GCP is blunter still: about thirty seconds of soft-off before the machine is gone. And Google's older preemptible VMs carry a hard 24-hour cap on top of that, which is why Google now steers new workloads to Spot VMs with no maximum runtime.
For training, thirty seconds is an insult. For stateless inference, thirty seconds is plenty — it's longer than most requests take to finish, which means a well-behaved replica can drain cleanly inside the notice window almost every time.
The pattern: over-provision, drain, fall back#
Inference doesn't need a checkpoint. It needs a bouncer at the door and a spare in the wings.
The serving stacks that take spot seriously all implement the same three moves. SkyServe, the serving layer of the open-source SkyPilot project (~10k stars, runs across 20+ clouds and Kubernetes), is the clearest statement of it. First, over-provision: run more spot replicas than your traffic needs and spread them across failure domains — different zones, regions, even clouds — so one provider reclaiming capacity can't take your whole fleet at once. Their own example replaces two on-demand replicas with three spot ones. Second, drain on notice: the instant the interruption signal fires, stop routing new requests to that replica and let the in-flight ones finish, while a replacement is provisioned in parallel. Third, fall back to on-demand when spot capacity dries up, then re-optimize back onto spot when it returns. SkyServe reports roughly 50% cheaper serving from this, more than 3× with spot replicas.
None of this is SkyPilot-specific. On Kubernetes, Karpenter does the same dance: it watches the interruption queue and, on the two-minute notice, "begins draining the node while in parallel provisioning a new node." The one non-negotiable is instance-type flexibility — give the scheduler a dozen acceptable GPU shapes, not one, or it can't find replacement capacity when your preferred type is exactly the type the cloud just reclaimed. And the autoscaling signal has to be right: Ray Serve scales on in-flight request count, not CPU, because LLM serving is queue-bound, and a CPU-based autoscaler will be blind to the only metric that matters.
The tax nobody mentions: cold start#
Here's the part that inverts the usual instinct. The intuitive move is "use cheap interruptible compute to absorb traffic spikes" — scale to zero when quiet, scale up on demand. For LLM serving that is often the worst possible fit, and the reason is cold start.
Every fresh spot node has to load the model into GPU memory before it can serve a single token. A 70B model in FP16 is about 140GB. From network-attached storage at a few hundred MB/s, that's minutes of a GPU sitting idle but billed; even from fast local NVMe it's tens of seconds. Each preemption-and-replacement pays that toll again. So if your traffic is bursty and you scale to zero and back, you can spend more on cold-start idle time than you saved on the discount — the cheapest GPU-hour quietly becomes the most expensive token.
Which means spot rewards the opposite of what people reach for it for. It pays off on steady, high-utilization fleets where replicas stay warm for hours and a preemption is a rare reroute, not a constant reload. It punishes spiky, scale-to-zero designs. The mitigations are getting good — vLLM's sleep mode plus GPU memory snapshots cut cold start by several times, and weight-streaming loaders do similar — and they're worth wiring in. But they shrink the tax; they don't repeal it. (If scale-to-zero is your real constraint, that's a different problem with different tools.)
The honest caveat: availability#
The last asterisk is that the discount only matters if the capacity exists. On the hyperscalers, the scarcest GPUs — H100-class — are filled into on-demand and reserved first, so spot H100 is frequently unavailable rather than merely interruptible. The teams getting steady spot economics tend to be on neoclouds with genuine interruptible tiers, or on Kubernetes pools flexible enough to take whatever GPU is cheap right now. Spot is not a coupon you clip once; it's a posture — over-provisioned, multi-zone, drain-ready, with an on-demand floor underneath. Build it that way and interruptible inference is one of the few places in this industry where the scary-sounding option is actually the disciplined one. Pair it with the right autoscaling on Kubernetes and the question stops being "is spot safe" and becomes "why is anything steady-state running on-demand."



