Topic

LLM Inference & Serving

The inference-and-serving library, read in order — from the first fork (self-host vs API) through which engine (vLLM, SGLang, TensorRT-LLM, TGI, Ollama, MLX/llama.cpp), which accelerator (H100/H200/B200, MI300X, Groq/Cerebras), throughput and scaling (continuous batching, prefill vs decode, tensor vs pipeline parallelism), decode and attention acceleration (speculative decoding, GQA/MLA, FlashAttention/PagedAttention), the KV cache (quantization, eviction, offloading), sampling and tokenization, the gateway/router in front (LiteLLM, OpenRouter, RouteLLM), and latency and cost operations.

Self-Hosting LLM Inference vs an API: The Break-Even Math

Is it cheaper to run an open model on your own GPUs than to call an API? The deciding number isn't the token price — it's how busy the GPU stays.

vLLM vs TensorRT-LLM vs TGI: Choosing a Production LLM Serving Engine

Three engines, one job: turn a model into a high-throughput endpoint. The feature gaps are closing — what's left is portability, vendor lock-in, and which project is still being built.

vLLM vs SGLang vs LMDeploy: Picking a Self-Hosted Inference Engine in 2026

With TGI archived and Hugging Face pointing everyone at vLLM and SGLang, the open-source serving field narrowed to three real choices. They hit nearly the same throughput ceiling from opposite directions — so speed is not the thing you're actually picking.

NVIDIA NIM vs vLLM vs TGI: How to Self-Host LLM Inference in 2026

One of these isn't an inference engine at all — it's a wrapper around the other two. Sorting that out is the whole decision, and it just got simpler because one contender quietly left the race.

Ollama vs LM Studio vs Jan: Running LLMs Locally in 2026

They all wrap roughly the same inference engine, so they all run the same model at roughly the same speed. The thing that actually separates them is what shape they want to be — a daemon, a polished app, or an open one.

MLX vs llama.cpp: Which Engine Should Run LLMs on Apple Silicon

Ollama just ripped out llama.cpp and bolted in Apple's MLX on the Mac. The switch is a tell about where your bottleneck actually lives — and when the older engine still wins.

GPU for LLM Inference: H100 vs H200 vs A100 vs L40S

Buyers shop for these cards by peak FLOPS. Token generation barely uses them. The spec that actually moves inference throughput is the one most spec sheets bury — and a single NVIDIA card proves it.

B200 vs H200 vs H100 for LLM Inference: Pick by Memory Wall, Not Peak FLOPS

The B200's headline 5-6x throughput jump is two different upgrades wearing one number — bigger HBM and FP4 compute — and which one matters depends entirely on whether your workload is memory-bound or compute-bound.

AMD MI300X vs NVIDIA H100 for LLM Inference: The Memory Wall and the Software Tax

It isn't a FLOPS race. Decode is memory-bound, and the MI300X's 192 GB lets a model live on fewer GPUs than an 80 GB H100 can. The catch was never the silicon — it was ROCm. Here's where that tax stands in 2026.

Groq vs Cerebras vs SambaNova: The Race for Faster-Than-GPU Inference

Three startups built custom silicon to outrun the GPU on token generation. The speed is real, the SRAM is tiny, and that tradeoff decides everything.

Continuous Batching vs Static Batching: Why LLM Serving Throughput Jumps an Order of Magnitude

Static batching wastes the GPU because LLM outputs are variable-length — short replies idle while the batch waits for the longest. Continuous batching schedules at every token step instead. The catch is that the same trick that wins throughput can spike latency.

Why LLM Inference Has Two Speeds: Continuous Batching and Prefill/Decode Disaggregation

A single tokens-per-second number hides two workloads pulling in opposite directions — and the whole arc of serving optimization is the field admitting they should never share a GPU.

Tensor Parallelism vs Pipeline Parallelism: How to Split an LLM Across GPUs

When one model won't fit on one GPU, you have two ways to cut it up — and the right cut is a description of your interconnect, not a tuning knob you guess at.

BentoML vs Ray Serve vs KServe: Choosing a Model-Serving Framework

Three ways to put a model behind an endpoint — and they increasingly run the same engine underneath, so the thing you are actually choosing is not speed.

Speculative Decoding, Explained: Why EAGLE Beats Medusa for Faster LLM Inference

Speculative decoding makes a single LLM response 2–6x faster without changing a token of the output. The reason it works — and why the newest method wins — is a fact about your GPU, not your model.

MHA vs MQA vs GQA vs MLA: How Attention Stopped Eating Your KV Cache

Every attention variant since 2019 has been one argument about the same scarce resource — the key-value cache — and the newest answer changes the terms of the deal.

FlashAttention vs PagedAttention vs FlashInfer: Three Different Problems, One Word

Stop choosing between them. FlashAttention is the compute kernel, PagedAttention is the memory layout, FlashInfer is the engine — a modern stack runs all three at once.

KV Cache Quantization: The Memory That Actually Caps Your LLM Throughput

You quantized the weights to 4-bit and thought memory was solved. At long context the KV cache dwarfs the weights — and it needs a different kind of quantization to shrink safely.

KV Cache Eviction: StreamingLLM vs H2O vs SnapKV vs Quest

Three of these throw tokens away to save memory. One keeps them all and just reads less — and for a long-running agent that revisits its own past, that difference is the whole game.

KV Cache Offloading: LMCache vs Mooncake vs NVIDIA Dynamo

Your engine computes a KV cache, uses it once, and throws it away. Offloading turns that scratchpad into a shared storage tier — and changes the question you should be asking.

Temperature vs Top-p vs Top-k: How LLM Sampling Actually Works

Three of these knobs do the same job — truncate the unreliable tail of the next-token distribution. The differences are smaller, and more contested, than the tutorials admit. And if you build agents, you probably want almost none of it.

tiktoken vs SentencePiece vs Hugging Face Tokenizers

Three libraries everyone compares as if you get to choose. You don't — your model already chose for you. The real question is what that choice costs, and who pays it.

LiteLLM vs Portkey vs TensorZero: Choosing an LLM Gateway in 2026

Every agent ends up talking to more than one model provider. The library you put in the middle decides whether that seam stays a proxy or quietly becomes your control plane.

OpenRouter vs LiteLLM: Which LLM Gateway for Your AI Agent Stack?

They get filed as rivals because both promise "one API for every model." But one is a hosted marketplace you buy from, the other is infrastructure you run — and the smart move is often to use both.

RouteLLM vs NotDiamond vs Martian: Do LLM Model Routers Actually Cut Costs?

Per-prompt model routing promises GPT-quality answers at a fraction of the bill. The honest 2026 answer is that it's a cost lever with a threshold, not a free one — and a neutral benchmark disagrees with the marketing.

LLM Inference Latency: TTFT vs TPOT vs Throughput, and Why 'Tokens Per Second' Is Two Numbers

The three numbers everyone quotes measure three different bottlenecks — and per-user speed and system throughput move in opposite directions, so a vendor's headline tok/s can mean whatever flatters it.

How to Reduce AI Agent Latency

Buying a faster model is the reflex, and usually the wrong first move. An agent's wait is a chain of serial round-trips — so the latency is in the loop, not the tokens-per-second.

How to Reduce AI Agent Token Costs

The cheaper-model reflex is the wrong first move. An agent's bill is dominated by the transcript it re-sends on every step — so the money is in the context, not the price card.