Open the architecture diagram for almost any new voice agent and you will find, somewhere between the microphone and the model, a box labeled speaker diarization. It is there because diarization sounds like exactly what a voice agent needs: a system that listens to a conversation and decides who said what. Drop it in the live pipeline, the reasoning goes, and the agent will always know whether it's hearing the user or its own echo.
It's the wrong box, in the wrong place, solving a problem the agent doesn't have.
Diarization is, precisely, the task of partitioning audio into segments and clustering those segments into unknown speakers — Speaker A, Speaker B — using voice embeddings, with no prior knowledge of who is in the room (AssemblyAI's 2026 roundup is a good map of the field). The keyword is unknown. The entire difficulty is that you don't know how many people there are or which voice is which, so you cluster embeddings and hope the geometry separates them.
Now count the speakers in a one-on-one voice agent. There are two. One of them is the agent — and you did not record the agent's voice off a microphone and try to recognize it. You synthesized it. You know, to the sample, exactly which audio is the bot, because your own text-to-speech produced it.
In a 1:1 agent, one of the two voices is labeled for free. There is nothing left to cluster.
The job you actually have is turn-taking#
Strip diarization out and ask what the live loop genuinely needs to know, and the answer isn't who is talking. It's when the human is done talking, so the agent can take its turn without stepping on the end of a sentence. That is turn detection — voice-activity detection plus an end-of-utterance model — and it is a different task with a different shape, one this publication has pulled apart before. VAD asks "is there speech right now"; turn detection asks "is the person finished." Neither needs to know the speaker's identity, because the identity was never in question.
This is why the production voice frameworks don't put a diarizer in the hot path. LiveKit's voice pipeline and Pipecat ship turn detectors — smart end-of-turn models, predictive silence — not speaker-clustering models. They are answering "is it my turn," over and over, in milliseconds.
The tell is in the latency numbers. When a streaming "diarization" pipeline feels sluggish, the lag is almost never the speaker clustering. As AssemblyAI puts it plainly in their streaming diarization guide, the dominant latency factor isn't the diarization at all — it's having to wait for someone to stop talking before a segment can be finalized. The slow part is turn detection wearing a diarization label.
Where diarization actually earns its place#
None of this means diarization is useless to a voice product. It means it belongs in two specific regimes, and the live one-on-one loop is neither.
Three or more humans on one stream. A meeting notetaker, a conference bridge, a sales-call assistant joining a room of people — now you genuinely have unknown speakers, and the free-label trick evaporates. This is real-time, multi-party, and hard: a label assigned while someone is still mid-word can't be taken back, and accuracy slides as speaker count climbs, because each person contributes less audio to build a profile from. The 2026 answers are online diarizers. NVIDIA's Streaming Sortformer collapses the old segment-embed-cluster pipeline into one end-to-end transformer that emits frame-level labels and tracks up to four participants as they enter the stream — at the cost of an NVIDIA GPU. diart takes the open-source route, wrapping pyannote's segmentation and embedding models in an incremental clustering loop over a rolling buffer refreshed roughly every 500ms.
After the call, in batch. For analytics — transcript attribution, talk-time ratios, "what did the customer object to" — you have the whole recording and all the time in the world, and here diarization is unambiguously the right tool. pyannote.audio 3.1+ is the open-source bar, reporting on the order of 18.8% diarization error rate on AMI and 21.7% on DIHARD III, with powerset segmentation to handle overlapping speech. Run it offline, against the full audio, where a label can be revised once more context arrives.
How to actually choose#
If you're building a one-on-one agent, the honest answer is you don't choose a diarizer at all — you choose a turn detector, and the speech-to-text decision is its own axis. The diarization question only becomes real the moment a third human can be on the line, or the moment the call ends and you want to study it.
When it is real: reach for pyannote.audio for batch analytics and full self-hosting; NeMo Streaming Sortformer for live multi-party when you already run on NVIDIA hardware and want an end-to-end model rather than a tuned pipeline; Deepgram or AssemblyAI's built-in diarization when you want multi-party labels to ride along inside a hosted streaming transcript and you'd rather not operate any of it. A useful middle idea, if you're building your own, is Turn-to-Diarize: let speaker-turn detection drive the diarizer, so the two tasks reinforce each other instead of competing for the same millisecond.
The general rule survives the specifics. Diarization is a clustering tool for unknown speakers. A voice agent's headline case has no unknown speakers and no clustering to do — it has a turn to take. Spend the latency budget there, and evaluate the agent on whether it waits for you, not on whether it can name a voice it already wrote.



