Every "best LLM serving engine" thread eventually becomes the same fight: someone posts a tokens-per-second chart, someone else posts a different one, and a third person points out the batch size was 1 in one test and 1,024 in the other. The fight never resolves because the question is malformed. Peak throughput is a property of a benchmark harness. The engine you should run is a property of your traffic — specifically, how much of the context your requests share with each other.
That single axis — your concurrency and prefix-sharing profile — sorts the field cleanly. Once you know whether you're one person or a thousand, and whether those thousand are hitting a shared system prompt, the choice mostly makes itself.
The four engines, by what they actually optimize
Ollama (174.6k GitHub stars, MIT) is a Go binary that wraps llama.cpp and reads GGUF weights. It loads a model on demand, keeps it warm for a few minutes, and serves you. What it does not do is continuous batching — the technique that lets a server interleave many requests through the GPU at once. That omission is not a bug; it's the product. Ollama optimizes for "does it run on my laptop without a fight," and on that axis nothing beats it.
vLLM (83.4k stars, Apache-2.0) is the engine that made high-throughput open serving normal. Its two load-bearing ideas are PagedAttention — managing the KV cache in fixed-size blocks like virtual memory pages, which kills fragmentation — and continuous batching, which swaps a finished sequence out of the running batch and a new one in without waiting for the whole batch to drain. The practical effect is that one GPU serves several times the traffic of a naive PyTorch loop, across a model zoo that tracks new architectures within days of release.
SGLang (29.4k stars, Apache-2.0) starts from PagedAttention-style batching and adds the thing this whole piece is about: RadixAttention. Instead of throwing away the KV cache between requests, it stores it in a radix tree keyed by the token sequence, so any two requests that share a prefix compute that prefix exactly once. The project claims up to 5x faster inference from this on the workloads it's built for. The arXiv paper lays out the mechanism in full.
TensorRT-LLM (NVIDIA, open-source but NVIDIA-only) is the performance ceiling if you've committed to NVIDIA silicon and are willing to compile. NVIDIA's own numbers put H100 with FP8 at over 10,000 output tokens/sec, climbing toward ~21,000 at batch size 1,024, and roughly 4.6x an A100. The price of that ceiling is a per-model build step and a hardware monoculture.
The axis that decides it
Forget the chart. Answer two questions.
Are your requests independent or do they share context? A RAG service prepends the same retrieved documents to every query. An agent replays a long system prompt and tool spec on every step. A chat product re-sends the conversation history on every turn. All three are prefix-heavy: most of the tokens going into the model are tokens it already saw a millisecond ago. That is exactly the redundancy RadixAttention deletes. If your prefixes are large and reused, SGLang's radix-tree cache turns repeated prefill into a tree lookup, and the gap over a block-hashing cache widens with every shared token.
The leaderboard measures a workload nobody runs; your bill is decided by how many tokens you compute twice.
If your requests don't share much — unique documents, one-shot classification, embeddings — then there's no shared prefix to cache, RadixAttention has little to chew on, and the decision collapses back to throughput, model coverage, and operational taste. That's vLLM's home court.
How many concurrent users are there? One is a different machine than a thousand. A single user — a developer, a desktop app, a script — gets nothing from continuous batching because there's no batch to be continuous about. The marginal request that justifies a serving engine never arrives. Ollama wins by default, and reaching for vLLM here buys you operational weight you'll never amortize.
So: which one
- One user, local, "just work": Ollama. Single binary, GGUF, automatic VRAM juggling, runs on a MacBook. The throughput it leaves on the table is throughput you weren't going to use.
- Many users, shared context (RAG, agents, multi-turn chat): SGLang. This is the case RadixAttention was designed for, and the more your traffic reuses a prefix, the more decisive it gets.
- Many users, broad or fast-moving model coverage, no compile step: vLLM. The default for a reason — widest architecture support, mature tooling, and you can swap models without rebuilding anything.
- Locked to NVIDIA, chasing the absolute throughput ceiling: TensorRT-LLM. Fastest if you accept the build step and the hardware lock-in.
One caveat that the prefix-sharing story can oversell: SGLang's edge is real on prefix-heavy traffic, but under brute high concurrency with little sharing, vLLM's batching path has held up well in head-to-head benchmark threads, partly because Python-side routing can bottleneck before the GPU does. Which is the whole point. There is no universal winner because "winning" is defined by a workload, and the engines have quietly specialized into the shapes of different ones.
The honest version of the comparison isn't a ranking. It's a question handed back to you: how many people, and how much of what they send have you already seen? Answer that and the engine is the easy part.



