For years, the honest answer to "what do I serve my open model with?" started with Hugging Face's Text Generation Inference. That era ended quietly. TGI went into maintenance mode in December 2025, and on March 21, 2026 the repository was archived — read-only, no new features, bug fixes only. The README now does something unusual for a piece of infrastructure: it points you at the competition, recommending vLLM, SGLang, llama.cpp, and MLX for anything new. Hugging Face decided it was cheaper to fund the engines that won than to keep running its own.

That decision is the real headline. The self-hosted inference field didn't fragment into a dozen options — it consolidated. For general-purpose GPU serving, three engines now matter, and they are all Apache-2.0: vLLM, SGLang, and LMDeploy.

The same ceiling, reached from opposite directions#

Here is the finding that should reframe how you shop. On a Llama 3.1 8B model on an H100, independent benchmarks put SGLang and LMDeploy in a near dead heat around ~16,200 tokens per second — roughly 29% ahead of vLLM's ~12,500 (AIMultiple, Spheron).

What makes that interesting isn't the gap. It's that the two leaders got there from architecturally opposite places. SGLang is Python plus hand-tuned native kernels, organized around RadixAttention — a prefix cache that reuses the key/value state of shared prompt prefixes across requests. LMDeploy's TurboMind is a pure-C++ engine from the InternLM team that removes the Python interpreter from the hot path entirely. One optimized the memory pattern; the other deleted the language overhead. They arrive within 0.6% of each other.

When two engines built on opposite principles crest at the identical throughput, the kernel math has been commoditized — what's left to win is orchestration.

Why the 29% is a trap#

The temptation is to read "29% faster" and route everything to SGLang or LMDeploy. Don't — not on that number alone. The gap is a small-model artifact. Push to a 70B-class model and the three engines converge to within a few percent of each other. The reason is physics, not code: at 8B on an H100 you are orchestration-bound — the bottleneck is how fast the engine can schedule, batch, and shuffle tokens, so a tighter scheduler wins. At 70B you become memory-bandwidth-bound — every engine is waiting on the same HBM, and no amount of C++ buys you around the wall. The benchmark that sells the difference is measured exactly where the difference exists.

So "which is fastest" is the wrong question. The right one is: which specialization survives contact with your actual workload?

Choosing by shape, not by leaderboard#

The bet you're actually placing#

Pick an engine in 2026 and you're not betting on speed — the peak numbers converge exactly where your models get big enough to matter. You're betting on an optimization axis: breadth (vLLM), prefix reuse (SGLang), or quantization density (LMDeploy). All three are permissively licensed, all three ship continuous batching and paged/radix attention and FP8/INT4, and the platform that used to sell you a fourth option is now paying two of these teams to keep going.

The field didn't crown a winner. It agreed on the shape of the problem — and split the remaining work three ways.