Three numbers get quoted whenever anyone benchmarks an LLM server: time to first token, time per output token, and tokens per second. They are routinely treated as if they were the same axis — faster is faster — and a vendor's slide will lead with whichever one looks best. They are not the same axis. They measure three different bottlenecks, and two of them move in opposite directions, which means a single headline "tokens per second" can be technically true and still tell you nothing.

Three numbers, three bottlenecks

Generation happens in two phases, and the metrics map onto them.

Time to first token (TTFT) is the wait before any output appears. It is governed by prefill — the model ingesting your entire prompt in one pass to build the KV cache. As NVIDIA's benchmarking guide notes, it folds in queueing and network time, but the dominant term is prompt processing, which is compute-bound and scales with how long your prompt is. Stuff more context in, and TTFT climbs.

Time per output token (TPOT), also called inter-token latency, is the average gap between each generated token after the first. This is the decode phase, and Databricks puts the number in human terms: "a TPOT of 100 milliseconds per token would be 10 tokens per second, or ~450 words per minute." Decode is memory-bandwidth-bound — every single token requires reading the full set of model weights out of GPU memory. The two compose into the only latency a user actually feels:

Latency = TTFT + TPOT × (tokens generated). A long prompt taxes the first term; a long answer is ruled by the second. They have different cures because they have different bottlenecks.

That compute-versus-memory split is the whole reason the numbers behave differently. Prefill is a big matrix multiply with high arithmetic intensity, so it saturates the GPU's compute. Decode reads gigabytes of weights to emit one token, so it is gated by bandwidth, not FLOPs — which is why, as Databricks states outright, "memory bandwidth... is a better predictor of speed of token generation than peak compute." The per-user ceiling is almost arithmetic: tokens per second at batch one is roughly memory bandwidth divided by model size in bytes. Databricks' own worked example — a 7B model in FP16 with a 14ms TPOT is moving its 14GB of weights in 14ms, or 1 TB/s — is the cleanest way to see it. FLOPs are not in that equation.

The number that lies: tokens per second

Now the trap. "Throughput" is total tokens per second across every request the server handles at once — a system-level number. Per-user output speed is what one conversation streams at. These are different metrics, and the link between them runs backwards.

Raise the batch size — pack more concurrent requests onto the GPU — and aggregate throughput soars, because that expensive weight-load now amortizes across the whole batch. But each request is sharing the hardware, so its TPOT gets worse. Anyscale measured the exact shape: going from batch 1 to batch 64 on an A100 lifted throughput up to 14x while raising latency about 4x. Per-user speed and system throughput are two ends of a seesaw, and batch size is the hand on it.

Which is why "300 tokens per second" is a number you cannot use until you know which one it is. Artificial Analysis is explicit that as concurrency rises, total system throughput goes up while per-user speed comes down — so a server advertising 488 aggregate tokens per second across 64 users is delivering about 7.6 to each of them, a crawl. A vendor optimizing for a throughput headline and a vendor optimizing for a snappy chat are tuning the same box in opposite directions, and both can print "tokens per second" on the slide. Always ask: per user, or total system? If it's aggregate, divide by the concurrency before you believe anything.

What to optimize, and the metric that reconciles them

Because the metrics conflict, "make it fast" is underspecified until you name the workload:

The field's answer to the conflict is to stop pretending one number suffices. DistServe named the right target: goodput — the request rate that still satisfies both your TTFT and TPOT SLOs, rather than raw throughput that hits its number by blowing the latency budget. The same paper shows why prefill and decode end up disaggregated onto separate hardware: one is compute-bound, the other memory-bound, and forcing both onto one box means continuous batching has to referee a collision between them. That collision is the real subject of every serving-engine bake-off; what vLLM, TensorRT-LLM, and TGI are competing on is how gracefully they sit on the latency-throughput frontier, not a single tokens-per-second crown.

So when the next benchmark crosses your desk, don't read the big number. Ask which phase it stresses, whether it's per-user or aggregate, and at what concurrency it was measured. The honest version of "how fast is it" is never one number — it's a curve, and the only point on that curve that matters is the one that meets your SLO at the load you actually run.