For years, the honest answer to "what do I serve my open model with?" started with Hugging Face's Text Generation Inference. That era ended quietly. TGI went into maintenance mode in December 2025, and on March 21, 2026 the repository was archived — read-only, no new features, bug fixes only. The README now does something unusual for a piece of infrastructure: it points you at the competition, recommending vLLM, SGLang, llama.cpp, and MLX for anything new. Hugging Face decided it was cheaper to fund the engines that won than to keep running its own.
That decision is the real headline. The self-hosted inference field didn't fragment into a dozen options — it consolidated. For general-purpose GPU serving, three engines now matter, and they are all Apache-2.0: vLLM, SGLang, and LMDeploy.
The same ceiling, reached from opposite directions#
Here is the finding that should reframe how you shop. On a Llama 3.1 8B model on an H100, independent benchmarks put SGLang and LMDeploy in a near dead heat around ~16,200 tokens per second — roughly 29% ahead of vLLM's ~12,500 (AIMultiple, Spheron).
What makes that interesting isn't the gap. It's that the two leaders got there from architecturally opposite places. SGLang is Python plus hand-tuned native kernels, organized around RadixAttention — a prefix cache that reuses the key/value state of shared prompt prefixes across requests. LMDeploy's TurboMind is a pure-C++ engine from the InternLM team that removes the Python interpreter from the hot path entirely. One optimized the memory pattern; the other deleted the language overhead. They arrive within 0.6% of each other.
When two engines built on opposite principles crest at the identical throughput, the kernel math has been commoditized — what's left to win is orchestration.
Why the 29% is a trap#
The temptation is to read "29% faster" and route everything to SGLang or LMDeploy. Don't — not on that number alone. The gap is a small-model artifact. Push to a 70B-class model and the three engines converge to within a few percent of each other. The reason is physics, not code: at 8B on an H100 you are orchestration-bound — the bottleneck is how fast the engine can schedule, batch, and shuffle tokens, so a tighter scheduler wins. At 70B you become memory-bandwidth-bound — every engine is waiting on the same HBM, and no amount of C++ buys you around the wall. The benchmark that sells the difference is measured exactly where the difference exists.
So "which is fastest" is the wrong question. The right one is: which specialization survives contact with your actual workload?
Choosing by shape, not by leaderboard#
- vLLM — the lowest-regret default. From UC Berkeley's Sky Computing Lab, it supports 200+ model architectures and the widest quantization matrix in the field (FP8, INT4/INT8, GPTQ/AWQ, GGUF, NVFP4). It gets new models on day one and needs no compilation step. If you have no specific reason to optimize, this is the pick — and Hugging Face agreeing with you is why TGI's traffic now defaults here. (If your shortlist also includes a lightweight local-first option, that's a different axis — see vLLM vs SGLang vs Ollama.)
- SGLang — for prefix-heavy traffic. Multi-turn chat, agent loops, and anything with a fat shared system prompt is where RadixAttention earns its keep, because the repeated prefix stops being recomputed on every call. It also has strong structured-output support, and it's the engine running in production at xAI, Cursor, LinkedIn, and others — a real signal about where it holds up at scale.
- LMDeploy — for quantized serving on scarce GPUs. TurboMind is built Int4-first, with online int8/int4 KV-cache quantization and a reported ~2.4x speedup over FP16 and up to ~1.8x higher request throughput than vLLM in its own numbers. When the job is "fit this large model onto one GPU I can actually rent," it's the sharpest tool on the bench.
The bet you're actually placing#
Pick an engine in 2026 and you're not betting on speed — the peak numbers converge exactly where your models get big enough to matter. You're betting on an optimization axis: breadth (vLLM), prefix reuse (SGLang), or quantization density (LMDeploy). All three are permissively licensed, all three ship continuous batching and paged/radix attention and FP8/INT4, and the platform that used to sell you a fourth option is now paying two of these teams to keep going.
The field didn't crown a winner. It agreed on the shape of the problem — and split the remaining work three ways.



