If you are building a voice agent and you choose your speech-to-text by reading the accuracy leaderboard, you will pick the wrong model. Not because the leaderboard lies — it's one of the few neutral references in this space — but because it measures the thing that matters least for a live conversation, and is silent on the two things that matter most.
The HuggingFace Open ASR Leaderboard ranks transcription models by word error rate on pre-recorded English audio, processed in batch. By that measure the winners are open weights: NVIDIA's Parakeet TDT 0.6B lands around 6.05% WER, OpenAI's Whisper large-v3 around 7.44%. Clean, comparable, and almost beside the point. A voice agent is not transcribing a file. It's holding a conversation in real time, and what makes that conversation feel human is latency and turn-taking — neither of which the leaderboard measures at all.
Why the leaderboard winners lose the conversation
Whisper is the most-installed speech model on earth and a poor direct fit for a live agent, for reasons that have nothing to do with accuracy. It processes a sliding 30-second window with no native streaming, so it isn't built to emit words as you speak. It has no concept of end-of-turn — it transcribes, it doesn't take turns. And it hallucinates on silence: fed non-speech, it will confidently invent phrases like "Thank you for watching," a failure the "Careless Whisper" study found in roughly 1% of transcripts. For batch captioning, none of this matters. For a bot that has to know when you stopped talking, all of it does.
The open ecosystem papers over the speed problem — faster-whisper makes inference 4x cheaper, WhisperX adds word timestamps and diarization — but not the shape problem. None of them add native streaming or endpointing. And the open model that genuinely beats Whisper now isn't Whisper at all: NVIDIA's Parakeet transcribes at better accuracy and one to two orders of magnitude faster (an RTFx in the thousands versus Whisper's ~146), under a commercial license. In 2026, self-hosted voice stacks quietly default to Parakeet. But it, too, ships a chunked streaming pipeline with no built-in turn detection. You still have to build the part that makes the agent feel alive.
What the hosted leaders actually sell now
Here is the shift worth noticing. Deepgram and AssemblyAI used to publish dueling WER benchmarks against each other. In 2026 they mostly stopped, because they figured out that accuracy is table stakes and turn-taking is the product.
The hard part of voice was never hearing the words. It's knowing when the human is done saying them.
Deepgram's answer is Flux, which it calls conversational speech recognition. Instead of bolting a separate voice-activity detector and endpointing heuristic onto a transcription model, Flux folds end-of-turn detection into the STT model itself, emitting explicit StartOfTurn and EndOfTurn events with a median end-of-turn under ~300ms. Deepgram's own numbers — vendor-sourced, so discount accordingly — claim this shaves 200–600ms off agent response latency versus a stitched STT+VAD pipeline and cuts false interruptions by around 30%, at Nova-3-level accuracy (~6.84% streaming WER). The accuracy is the floor. The turn events are the pitch.
AssemblyAI's Universal-Streaming makes a different, equally architectural bet: immutable transcripts. Most streaming STT emits volatile "interim" results that get silently rewritten a beat later as more audio arrives — which means an agent's LLM can't safely start reasoning on early words, because they might change. Universal-Streaming never rewrites an emitted word. Partials land in roughly 300ms and stay put, so your agent can begin planning its response on the first half of a sentence without risk. Pair that with intelligent endpointing that fuses acoustic, semantic, and silence cues, and at around $0.15/hour it's the cheapest serious option here. The immutability is the quiet masterstroke — it removes a whole class of race condition from your agent loop.
The actual decision tree
Stop comparing word error rates and ask where the agent runs and who owns the turn.
- You want a great-feeling live agent, fast, hosted. Deepgram Flux or AssemblyAI Universal-Streaming. You're buying turn detection and immutable low-latency partials — the things that decide whether the bot talks over the user — not a tenth of a percent of WER. Choose Flux if you want the turn events handed to you; choose AssemblyAI if immutable partials and price-per-hour matter more.
- You need private, cheap, self-hosted, and you can build the rest. Parakeet, not Whisper. It wins on both accuracy and speed and is commercially licensed; budget the engineering to add your own VAD and endpointing, because that's the part it doesn't give you — and the part that matters.
- You're transcribing recordings, not holding conversations. This is where raw Whisper (via faster-whisper) still earns its 100k stars. Just keep it away from live turns, and watch it for phantom text on silence.
The throughline is that voice-agent STT stopped being a transcription problem and became a conversation problem. The leaderboard still ranks the transcribers. The agents that feel human are running on whoever solved turn-taking — which is exactly the seam your orchestration framework is trying to manage, and exactly the reason to pick the STT layer for how it ends a turn, not how it spells a word.



