Someone tells you their LLM endpoint does "3,000 tokens per second." That number is not wrong, exactly. It's just uninterpretable — like being told a car "does 60" without being told whether that's top speed, zero-to-sixty, or fuel economy. A throughput figure quoted without the load that produced it and the prompts that fed it describes nothing you can act on, and certainly nothing you can compare.
Benchmarking LLM serving correctly is not hard, but it requires giving up the thing everyone wants — a single score — for the thing that's actually true: a curve.
The four numbers, and the one identity that ties them together#
Start with what to measure. There are four metrics, and conflating them is the first mistake:
- TTFT (time to first token) — how long until the first token appears. Set by prefill; scales with input length. This is responsiveness.
- TPOT / ITL (time per output token, inter-token latency) — the steady-state gap between tokens during decode. This is how smooth the stream feels.
- End-to-end latency — the whole request.
- Throughput — output tokens/sec and requests/sec, summed across all concurrent requests.
The identity that connects them is the one people forget: end-to-end latency ≈ TTFT + TPOT × output length. Which means a latency number with no output length attached is meaningless, and two runs with different output lengths are not comparable. Pin the lengths, report them, or your benchmark measures nothing repeatable. (This is also why TTFT and TPOT trade off against each other — optimizing one usually taxes the other.)
Load is half the measurement#
The second mistake is treating "the workload" as if it's just the model. It isn't — it's the model plus how hard you push it. You have to specify the offered load, and there are two honest ways to do it:
- Closed-loop holds a fixed number of concurrent requests in flight. Good for modeling a known concurrency on a batched server.
- Open-loop fires requests on a schedule — a target QPS, usually Poisson-distributed arrivals. This is what surfaces head-of-line blocking and saturation, because if your arrival rate exceeds what the server can clear, the queue grows without bound and tail latency explodes. That explosion is the point; it's what a closed-loop test politely hides.
Whichever you choose, report percentiles, not averages. The mean latency is dominated by the fast requests; your users live in the p99. A system with a lovely average and a hideous p99 is a system that feels broken to one request in a hundred — which, at scale, is constant.
The deliverable of a serving benchmark is not a number. It's a latency-versus-throughput curve, and the one point on it worth quoting is goodput.
The number that's actually worth quoting: goodput#
Here's the non-obvious part. If you sweep the offered load and plot throughput against latency, you don't get a flat line — you get a curve with a knee. Below the knee, adding load adds throughput at roughly constant latency. At the knee, throughput stops rising and latency starts climbing steeply: you're saturated. Past it, you're just building a queue.
The right capacity number is read off that curve, not from its peak. Fix your SLO — say, p99 TTFT under 200 ms and p99 TPOT under 50 ms — sweep the load, and find the maximum request rate at which p99 still meets the SLO. That's goodput, the metric the DistServe paper introduced precisely because a system can post enormous raw throughput while violating its latency targets for most requests. Raw throughput counts the slow completions; goodput only counts the ones that were actually good. Quote goodput and the SLO together, and you've said something true. Quote peak tokens/sec alone and you've said something that sounds good.
The prompt shape changes everything#
The same hardware, same model, same software can look like two different products depending on what you feed it. A prefill-heavy workload — long input, short output, the shape of RAG and summarization — is compute-bound and batches beautifully. A decode-heavy workload — short input, long output, the shape of chat and code generation — is memory-bandwidth-bound and behaves completely differently. The two can differ by an order of magnitude in tokens/sec on identical hardware.
So benchmark with the length distribution you'll actually serve. Use a real trace or a dataset like ShareGPT for realistic variety, or fixed synthetic lengths when you need reproducibility — but never quote a result without stating the input/output length mix, because it's doing as much work in the number as the GPU is.
The tools, and the traps they don't save you from#
You don't have to build this. vLLM ships vllm bench serve (formerly benchmark_serving.py), which drives request-rate or concurrency loads against most backends and reports the full TTFT/TPOT/ITL breakdown. GuideLLM (now under the vLLM project) specializes in exactly the sweep described above — it ramps load progressively to find your safe operating range against an SLO. Ray's llmperf load-tests hosted APIs and deliberately counts tokens with a single fixed tokenizer so cross-provider numbers stay honest. NVIDIA's GenAI-Perf is everywhere in older guides but is now deprecated in favor of AIPerf. For standardized cross-system comparison there's MLPerf Inference and SemiAnalysis's InferenceMAX, which re-runs nightly so its numbers track software improvements instead of going stale.
The tools measure faithfully; they won't stop you from measuring the wrong thing. Watch for the classics. Benchmarking at concurrency 1 gives you the latency floor and zero information about capacity — you never reach the knee. Comparing tokens/sec across different tokenizers is comparing different units. And the quiet one: prefix caching. If every request in your dataset shares a system prompt, the server caches that prefix and skips most of prefill — your TTFT and throughput now describe a cache hit, not your model under real, diverse traffic. Control it explicitly, and know whether the number you're about to publish is your system or your cache flattering you.
Pick the SLO first. Sweep the load. Read off goodput. Everything else is a number without a question.



