3.07s

Time to first token

4.94s

Latency / call

39.5s

Agent task, end-to-end

Time spent waiting (TTFT)

3.07s to first token, 4.94s per call. This 8-step agent task runs ~39.5s end-to-end — 62% of it spent waiting for first tokens, not generating.

How the estimate works

A request's wall-clock has two regimes. Time to first token (TTFT) is fixed overhead — queueing, scheduling, the network round-trip — plus the time to prefill the prompt: the model reads every input token before it can emit one. Then generation streams the answer one token at a time at the model's decode speed. So a single reply is overhead + prompt/prefill_rate + output/decode_rate, and for a long chat answer the decode term dominates — which is why "tokens per second" is the headline everyone quotes.

Agents break that intuition. An agent serializes many short calls: each tool-use step re-reads a growing context (a long prefill) and emits a tiny action — a function call, a few words of plan (a short decode). It pays the TTFT tax once per turn while barely touching the decode regime, so end-to-end the task is dominated by time-to-first-token, not raw throughput. Push the step count up with a short output and watch the "time spent waiting" figure climb past half. The practical consequence: a high-tokens/sec model can feel sluggish in a loop, and a model with a snappier TTFT can win a multi-step task despite a lower throughput headline. The fixes that matter are the prefill-side ones — prompt caching to skip re-reading the unchanged context, and fewer, fatter turns.

The model × hardware speeds here are typical order-of-magnitude defaults, not a benchmark — every field is editable, so drop in your own measured TTFT and tokens/sec. The concepts are unpacked in TTFT vs. TPOT: the two numbers that define LLM latency and prefill vs. decode; the agent-side playbook is in how to reduce an AI agent's latency. Sizing the hardware or the bill instead? See the VRAM calculator and the cost calculator.

LLM latency calculator

How the estimate works

Sources

Dispatches from the machines, in your inbox