The inference-and-serving library, read in order — from the first fork (self-host vs API) through which engine (vLLM, SGLang, TensorRT-LLM, TGI, Ollama, MLX/llama.cpp), which accelerator (H100/H200/B200, MI300X, Groq/Cerebras), throughput and scaling (continuous batching, prefill vs decode, tensor vs pipeline parallelism), decode and attention acceleration (speculative decoding, GQA/MLA, FlashAttention/PagedAttention), the KV cache (quantization, eviction, offloading), sampling and tokenization, the gateway/router in front (LiteLLM, OpenRouter, RouteLLM), and latency and cost operations.
Is it cheaper to run an open model on your own GPUs than to call an API? The deciding number isn't the token price — it's how busy the GPU stays.
Three engines, one job: turn a model into a high-throughput endpoint. The feature gaps are closing — what's left is portability, vendor lock-in, and which project is still being built.
With TGI archived and Hugging Face pointing everyone at vLLM and SGLang, the open-source serving field narrowed to three real choices. They hit nearly the same throughput ceiling from opposite directions — so speed is not the thing you're actually picking.
One of these isn't an inference engine at all — it's a wrapper around the other two. Sorting that out is the whole decision, and it just got simpler because one contender quietly left the race.
They all wrap roughly the same inference engine, so they all run the same model at roughly the same speed. The thing that actually separates them is what shape they want to be — a daemon, a polished app, or an open one.
Ollama just ripped out llama.cpp and bolted in Apple's MLX on the Mac. The switch is a tell about where your bottleneck actually lives — and when the older engine still wins.
Buyers shop for these cards by peak FLOPS. Token generation barely uses them. The spec that actually moves inference throughput is the one most spec sheets bury — and a single NVIDIA card proves it.
The B200's headline 5-6x throughput jump is two different upgrades wearing one number — bigger HBM and FP4 compute — and which one matters depends entirely on whether your workload is memory-bound or compute-bound.
It isn't a FLOPS race. Decode is memory-bound, and the MI300X's 192 GB lets a model live on fewer GPUs than an 80 GB H100 can. The catch was never the silicon — it was ROCm. Here's where that tax stands in 2026.
Three startups built custom silicon to outrun the GPU on token generation. The speed is real, the SRAM is tiny, and that tradeoff decides everything.
Static batching wastes the GPU because LLM outputs are variable-length — short replies idle while the batch waits for the longest. Continuous batching schedules at every token step instead. The catch is that the same trick that wins throughput can spike latency.
A single tokens-per-second number hides two workloads pulling in opposite directions — and the whole arc of serving optimization is the field admitting they should never share a GPU.
When one model won't fit on one GPU, you have two ways to cut it up — and the right cut is a description of your interconnect, not a tuning knob you guess at.
Three ways to put a model behind an endpoint — and they increasingly run the same engine underneath, so the thing you are actually choosing is not speed.
Speculative decoding makes a single LLM response 2–6x faster without changing a token of the output. The reason it works — and why the newest method wins — is a fact about your GPU, not your model.
Every attention variant since 2019 has been one argument about the same scarce resource — the key-value cache — and the newest answer changes the terms of the deal.
Stop choosing between them. FlashAttention is the compute kernel, PagedAttention is the memory layout, FlashInfer is the engine — a modern stack runs all three at once.
You quantized the weights to 4-bit and thought memory was solved. At long context the KV cache dwarfs the weights — and it needs a different kind of quantization to shrink safely.
Three of these throw tokens away to save memory. One keeps them all and just reads less — and for a long-running agent that revisits its own past, that difference is the whole game.
Your engine computes a KV cache, uses it once, and throws it away. Offloading turns that scratchpad into a shared storage tier — and changes the question you should be asking.
Three of these knobs do the same job — truncate the unreliable tail of the next-token distribution. The differences are smaller, and more contested, than the tutorials admit. And if you build agents, you probably want almost none of it.
Three libraries everyone compares as if you get to choose. You don't — your model already chose for you. The real question is what that choice costs, and who pays it.
Every agent ends up talking to more than one model provider. The library you put in the middle decides whether that seam stays a proxy or quietly becomes your control plane.
They get filed as rivals because both promise "one API for every model." But one is a hosted marketplace you buy from, the other is infrastructure you run — and the smart move is often to use both.
Per-prompt model routing promises GPT-quality answers at a fraction of the bill. The honest 2026 answer is that it's a cost lever with a threshold, not a free one — and a neutral benchmark disagrees with the marketing.
The three numbers everyone quotes measure three different bottlenecks — and per-user speed and system throughput move in opposite directions, so a vendor's headline tok/s can mean whatever flatters it.
Buying a faster model is the reflex, and usually the wrong first move. An agent's wait is a chain of serial round-trips — so the latency is in the loop, not the tokens-per-second.
The cheaper-model reflex is the wrong first move. An agent's bill is dominated by the transcript it re-sends on every step — so the money is in the context, not the price card.