You ship an agent. It works. Then someone uses it and the first thing they say is that it's slow — eight, fifteen, thirty seconds of a spinner before anything useful happens — and your first instinct, everyone's first instinct, is to reach for a faster model or a faster inference provider. Hold that thought, because for an agent it's usually the wrong lever, and understanding why tells you which levers are the right ones.

The wait has a shape, and the shape is a chain

A chatbot has a simple latency: one prompt in, one answer streaming out. You wait for the first token, then you read as fast as it generates. Two numbers describe it — time to first token (TTFT) and the inter-token latency that follows. NVIDIA's own breakdown is the useful one: TTFT is request queuing plus prefill plus network, and "the longer the prompt, the larger the TTFT," because prefill cost scales with how many input tokens the model has to read before it can speak.

An agent does not work that way. An agent loops. It calls the model, gets back a tool call, runs the tool, appends the result (which is also why it gets expensive), and calls the model again — and it cannot start call two until call one has fully returned, because call two's input is call one's output. The calls are serial. So an agent's latency is not one TTFT; it's TTFT-plus-generation summed down a chain of N sequential model calls. The spinner your user is watching is that whole chain.

An agent's latency isn't tokens per second. It's the number of times it has to stop and ask the model — in a row.

And here's the part that makes the naive fix backfire: every link in that chain re-sends the transcript so far, because the model is stateless and the only way it "remembers" step three is that you paste steps one and two back in. That re-sent transcript is input. Input is prefill. Prefill is TTFT. So a growing context slows the agent twice — more prefill per call, charged on every call in the chain. A faster tokens-per-second rate doesn't touch any of that. It speeds the generation slice of each call while leaving the round-trips and the prefill exactly where they were.

Fix the chain before you fix the tokens

The leverage, in order:

Make fewer round-trips. This is the critical path, so removing a link beats speeding one up. The two moves: run independent tool calls in parallel instead of one per round-trip — OpenAI ships parallel_tool_calls on by default precisely for this, and four independent 300ms calls done concurrently finish in ~300ms instead of 1.2s. And collapse plan-then-act sequences where the model already has what it needs to act, instead of making it narrate a plan in one call and execute it in the next.

Skip the prefill on what repeats. Your system prompt, tool definitions, and prior turns are identical across calls. Prompt and prefix caching reuse the already-prefilled prefix instead of re-reading it — which is a latency lever, not just the cost lever it's usually sold as. Anthropic clocks a 100,000-token cached prompt answering in 2.4 seconds versus 11.5 uncached; OpenAI applies caching automatically once a prompt crosses ~1,024 tokens. Pair it with the obvious companion: trim the context. Evict dead tool output before it gets re-sent, so the prefill you can't cache stays small.

Then, and only then, speed the tokens. Once the chain is short and the prefill is lean, output speed is the remaining slice. This is where the fast silicon earns its place — Cerebras and Groq serve 70B-class models north of 2,000 tokens per second, and speculative decoding like EAGLE-3 reports up to 6.5x generation speedup with no change to the output distribution. Route the easy steps to a smaller, faster model while reserving the flagship for the calls that actually reason; RouteLLM shows you can hold ~95% of the strong model's quality while calling it on only ~14% of queries.

And for the last step the human actually reads, stream it. Streaming doesn't change total time, but it converts a long wait into a short one plus reading — the user experiences TTFT as the wait, not the final token. Stream the user-facing turn; don't bother streaming the internal tool-deciding turns, because nothing is reading them.

The one mental model

Latency optimization for agents is critical-path optimization, and the critical path is the serial chain of model calls. Every other technique is downstream of that one fact. Count your round-trips first. Most agents that feel slow are not running slow models — they're running too many calls, each prefilling a transcript that got fat, in a straight line. Shorten the line, shrink what each link carries, and then go shopping for faster tokens.