A voice agent lives or dies on one perception: did it answer like a person would, or did it pause like a machine thinking? That perception is almost entirely a latency story, and it is the part most TTS comparisons measure wrong. They rank models on mean opinion score — how good a sample sounds in isolation — when the metric that governs a real conversation is time-to-first-audio (TTFA): how long until the first chunk of speech reaches the listener's ear, measured end to end, in production. Quality you can hear once. Latency you feel on every turn.
The metric the headlines quote, and the one that bites
Read the vendor pages and you'll collect a set of impressive small numbers. Cartesia's Sonic advertises a model latency around 90ms. ElevenLabs's Flash v2.5 targets roughly 75ms of inference time. These are real, and they are also not what your users experience, because they measure only the time the model spends generating. They exclude the network round trip to the API, any request queuing, and the delivery of that first audio chunk back to you.
When independent benchmarks measure the thing that actually matters — production TTFA at the median — the numbers move by multiples. Reported P50 TTFA lands near 188ms for Cartesia and in the 264–288ms range for ElevenLabs's fast models. The ordering is preserved, but the absolute gap between "model latency" and "what the user hears" is the single most important fact in this whole category, and it's the one nobody puts on a landing page.
The latency you can buy is the model's. The latency you can't is the network's — and it's the bigger half.
Why Cartesia is fast: it's the architecture
Cartesia's speed isn't a tuning trick; it's a bet on a different model family. The company was founded by the creators of state-space models — Albert Gu, Karan Goel, Chris Ré and colleagues, the people behind S4 and Mamba — and Sonic is an SSM, not a transformer. The relevant property: an SSM processes each new step in roughly constant time and scales sub-quadratically, where a transformer's attention grows with the context it has already produced. Audio is long, streaming, and latency-sensitive — exactly the signal SSMs were designed to love. That architectural fit is why Cartesia can credibly chase the lowest model latency in the field rather than merely optimizing an inference stack at the margins.
Why ElevenLabs still wins the ear
If Cartesia owns the clock, ElevenLabs owns the timbre. Its strength is fidelity and voice cloning — the naturalness and character that make a synthetic voice stop sounding synthetic. Its model lineup is explicitly a latency-quality dial: Flash trades some richness for speed, Turbo sits in the middle, and the multilingual flagship spends latency to buy the most lifelike output. For an agent where the brand is the voice — a premium concierge, a character, a narrator — the few hundred milliseconds may be worth paying. The tradeoff is the product decision, and ElevenLabs is honest that it's a dial, not a free lunch.
Why a tiny open model is the sleeper pick
Here's the non-obvious move. If the network round trip is the dominant, un-optimizable chunk of cloud TTFA, then the highest-leverage thing you can do is delete the network hop — by running the model next to your agent. That's newly practical because of Kokoro-82M: an 82-million-parameter, Apache-2.0 model built on a StyleTTS2 architecture with an ISTFTNet vocoder, about 300MB on disk, that generates speech faster than real time on a plain CPU and topped the TTS Arena leaderboard despite its size.
Co-locate Kokoro with your agent — same box, same data center — and you trade the variable, vendor-controlled latency floor of a cloud API for a fixed one you own and can profile. You also escape per-character pricing at volume and keep audio on your own infrastructure, which matters for regulated or on-prem deployments. The cost is real: you give up the top-tier naturalness and the large, polished voice libraries of the frontier APIs, and you take on the ops of serving a model. (This is the same architecture-vs-managed tension that runs through the rest of the voice stack — see how it plays out for transcription and for the orchestration layer that ties STT, the LLM, and TTS together.) It's not coincidence that small open TTS models — Kokoro among them — are quietly powering a lot of self-hosted narration where latency and cost matter more than a celebrity voice.
The decision, made plainly
- Cartesia Sonic when raw streaming latency is the product and you want the lowest cloud TTFA available, architecture-backed.
- ElevenLabs when the voice itself is the experience and you'll spend a few hundred milliseconds to get fidelity and cloning.
- Kokoro (self-hosted) when you want to own your latency floor, run on-prem or at the edge, or cut cost at volume — and can accept good-not-frontier audio.
The trap is benchmarking on the wrong number. Don't choose the voice that sounds best in a quiet demo; choose the one whose first millisecond arrives soonest in the place your users actually are. Price the round trip, not the rendering, and the architecture chooses itself.



