You typed "MIG vs MPS vs time-slicing" because you have one expensive GPU and more than one thing that wants it. The framing implies a ladder — that one of these is the grown-up answer and the other two are compromises. It isn't a ladder. The three are answers to a single question asked three different ways: what happens at the moment two tenants collide on the same silicon? MIG says they never touch. MPS says they run at once and mostly behave. Time-slicing says they take turns and hope.

Pick by that collision, not by a utilization dashboard. And before you pick at all, check whether you should be sharing.

The default is the trap

In Kubernetes, the path of least resistance is time-slicing. It is three lines in a ConfigMap, it works on every NVIDIA card you own, and the device plugin will happily advertise one H100 as ten "GPUs." It looks like free capacity.

It is round-robin context switching. The driver gives one process the GPU for a small quantum — on the order of a millisecond or two — then preempts it and hands the card to the next. There is no memory isolation: two replicas share the same HBM, and one can OOM the other into a crash loop. There is no fault isolation either.

For training experiments or notebooks, none of that matters. For inference it is close to a worst case, and the reason is specific. LLM serving has two phases with opposite characters: a compute-bound prefill over the prompt, then a long, latency-bound decode that emits one token at a time. Decode barely uses the GPU's compute — it is waiting on memory bandwidth — but it needs to be resumed promptly to keep inter-token latency smooth. Serialize it against a noisy neighbor and your tokens arrive in clumps. The throughput chart looks busy; the user watches a cursor stutter. (This is also why continuous batching exists — and why it argues against sharing at all, below.)

MIG: walls you can't move

Multi-Instance GPU carves the physical die into instances, each with its own slice of compute, cache, and memory controllers. An instance is a real, hardware-isolated mini-GPU: separate memory, separate fault domain. A kernel that faults in one instance does not touch the others. This is the only one of the three that gives you a genuine SLA per tenant, which is exactly what you want when the tenants are different customers.

The cost is rigidity. The profiles are a fixed menu — on an H100 you choose among slices like 1g.10gb, 2g.20gb, 3g.40gb — capped at seven instances per GPU, and you cannot resize an instance without destroying and recreating it (which evicts whatever was running). Worse for LLMs: each instance sees only its slice of HBM. A 3g.40gb instance has 40 GB, full stop. That caps both the model you can load and the batch you can build, and the rounding between profile sizes quietly strands gigabytes you paid for.

MPS: concurrency with a shared fate

Multi-Process Service is the one people forget. Instead of taking turns, multiple processes submit kernels that execute concurrently through a single service — so two small models that each use 30% of the GPU can genuinely run at the same time and approach full utilization. On Volta and later, each client gets its own isolated address space, so a stray pointer in one model doesn't corrupt another's memory.

The catch is the failure domain. A fatal CUDA error in one client can bring down the MPS control daemon, and when it goes, every co-located client goes with it. One bad deploy, and the blast radius is the whole card. NVIDIA still labels device-plugin MPS support experimental. MPS is the right tool when your co-tenants are your own trusted small models — an embedder, a reranker, a guard model — not arbitrary tenants you'd hate to take down together.

The better your serving stack, the less you should partition the hardware underneath it.

The answer the question hides

Here is the part the comparison tables omit. Modern LLM serving engines are built to own the whole GPU. Continuous batching works by keeping a large, dynamically changing batch resident and packing every spare cycle of the latency-bound decode phase with other requests' work. To do that it wants all the streaming multiprocessors and, above all, all the HBM — that is where the KV cache lives, and KV cache is what actually caps your throughput.

Every partition you draw fights this. MIG hands the engine a fraction of the memory, so the KV cache shrinks and so does the batch. Time-slicing makes the engine share cycles it was counting on. The uncomfortable conclusion: if you are serving one model big enough to need the card, the correct number of tenants is one. Give it the whole GPU, turn on continuous batching, and your "utilization problem" disappears — the engine fills the card with concurrent requests instead of concurrent processes. If you need many fine-tuned variants, that is a multi-LoRA problem for the serving layer, not a hardware-partitioning problem.

Sharing earns its keep in exactly three situations: a fleet of small models that each underutilize a GPU (reach for MPS), multi-tenant isolation where one customer must never starve or crash another (reach for MIG), and dev/test where nobody is paying attention to latency (time-slicing is fine, and free). Notice that none of those is "the production endpoint for my flagship model." That one doesn't want a slice. It wants the card.