Once your model is trained and quantized, one unglamorous piece of software decides whether your GPU bill makes sense: the serving engine. It's the thing that takes a folder of weights and turns it into an endpoint that can hold hundreds of concurrent conversations without falling over — batching requests, juggling the attention cache, and streaming tokens back. Three projects own this layer for open-weights models. They look interchangeable on a feature matrix, and increasingly they are. The interesting differences are the ones a feature matrix hides.
The thing they all do
Every modern serving engine is built around the same insight: a GPU running one request at a time is a GPU mostly sitting idle, memory-bound and waiting. The way to make it pay is continuous batching — packing many requests through the model together and adding or dropping sequences token by token instead of waiting for a whole batch to finish. The hard part of continuous batching is the KV cache: every active sequence holds a growing key-value cache, and naive allocation fragments GPU memory so badly you can batch far fewer requests than the math says you should.
The engine that solved this most influentially is vLLM.
vLLM's PagedAttention (SOSP 2023) treats the KV cache like virtual memory — storing it in non-contiguous fixed-size pages instead of one contiguous block — which nearly eliminates the fragmentation waste and lets you batch dramatically more requests on the same card. The paper reported 2–4x the throughput of the prior state of the art at equal latency. That single idea is why vLLM became the community default and why nearly every competitor adopted some version of it.
The other thing to know about vLLM is portability. It runs on NVIDIA, AMD (ROCm), Intel (XPU/Gaudi), Google TPU, and even CPU, with a plugin system for more. And it isn't standing still: the V1 engine, a ground-up rewrite of the scheduler, KV-cache manager, and API server, is now the default and claims up to 1.7x the throughput of the old core. If you want one engine that runs everywhere and is on the most active mainline, this is it.
The fastest engine, with an asterisk
TensorRT-LLM is NVIDIA's answer, and on NVIDIA silicon it typically wins the raw-throughput crown. The reason is also the catch: instead of interpreting a model at runtime, it compiles the model into a serialized, hardware-specific TensorRT engine — fusing kernels, baking in quantization (FP8 on Hopper), and tuning for your exact GPU. NVIDIA's own benchmarks cite very high output-token rates on H100. The C++/CUDA share of the repo (well over 40% combined) tells you where that speed comes from: compiled kernels, not a Python hot path.
The asterisk has two parts. First, it is NVIDIA-only — by design, there is no portability story; choosing it is choosing the vendor. Second, that ahead-of-time engine build has historically been real operational friction: a separate step, re-run per model and per GPU, that complicates your deploy pipeline. NVIDIA knows this, which is why recent releases ship a PyTorch-native backend and a high-level LLM API that move it toward vLLM's load-and-go ergonomics. The compile-step penalty TensorRT-LLM once exclusively owned is shrinking — but the lock-in isn't.
The HF-native option that stopped racing
TGI is the engine that feels like home if you already live in the Hugging Face ecosystem. Its router and launcher are written in Rust for low overhead; the model code is Python; it integrates directly with the Hub and powers HF's hosted Inference Endpoints. Its v3 release added serious long-context and prefix-caching work. It has also had a bumpy governance history worth knowing: in 2023 HF briefly relicensed it away from Apache-2.0 to a source-available license that restricted reselling it as a hosted service, then reverted to Apache-2.0 in 2024 after community pushback.
The decisive fact about TGI in 2026 is on its own repo page: it is in maintenance mode. The team accepts bug fixes, docs, and lightweight maintenance — but it is no longer the place where the feature race is being run. That doesn't make TGI a bad choice; it's stable, supported, and excellent if you're deploying through HF Inference Endpoints. It makes it the legacy-comfort choice rather than the forward bet, and that distinction should weigh heavily if you're standing up infrastructure you intend to keep for years.
How to actually choose
Stop comparing peak tokens-per-second screenshots. The benchmarks move with every release, the feature sets are converging — vLLM has prefix caching, TGI has long-context optimizations, TensorRT-LLM has a PyTorch path — and your real throughput depends on your model, quantization, and batch shape far more than on the engine's logo. The durable differences are the ones that won't change with the next point release:
- Want the portable, actively-developed default? vLLM. It runs on whatever accelerators you have now and the ones you'll buy later, and it's the mainline everyone else tracks. For most teams this is the correct first answer.
- Locked to NVIDIA and chasing every last token/sec? TensorRT-LLM. You're trading portability and a build step for the highest ceiling on the hardware you've already committed to.
- Already all-in on Hugging Face? TGI. The least-friction path inside that ecosystem — just go in knowing you're adopting a project in maintenance mode, not one sprinting on features.
The engines that started as rivals are quietly becoming commodities that do the same job well. Which means the question worth asking isn't "which is fastest this quarter" — it's "which bet on portability, vendor, and project momentum do I still want to be living with in two years." If you also need to choose among the managed APIs that wrap these engines, that's a different decision about hosted inference providers; this one is about the engine you run yourself.



