Pick apart a voice agent that feels subtly hostile to talk to, and the fault is rarely the voice. The text-to-speech is fine. The model is fine. What's wrong is the timing: you pause to find a word and it pounces, finishing your sentence with an answer to a question you hadn't finished asking. The agent didn't decide you were done. It decided you had gone quiet, and it treated those as the same thing.
They are not the same thing, and the gap between them is the whole problem.
Two different questions
Voice Activity Detection answers one question: is there speech right now? Models like Silero VAD (MIT) and WebRTC's classic VAD slice the incoming audio into frames and label each one speech or silence, fast and cheap enough to run on a CPU at the edge. That's genuinely useful — it's the gate that tells the rest of the pipeline when to start listening and when to stop. But a VAD has no idea what you said. It cannot tell a finished sentence from a held breath, because acoustically they're identical.
Turn detection answers the question that actually matters: are they done? And here's the trap the entire first generation of voice agents fell into — they answered it with a stopwatch. VAD reports silence; a timer counts off, say, 700 milliseconds; the agent takes the floor. The design is seductive because it's trivial to build and it mostly works on short, clipped utterances. It falls apart on the way humans actually talk.
"I'd like to book a flight to—" is obviously unfinished, even after a full second of silence. A clock cannot hear that. A model that reads the words can.
Tune the timeout and you only move the pain around. Make it short and the agent is snappy and rude, jumping on every pause. Make it long and it stops interrupting but now there's a dead beat of latency after everything you say, which reads as slow and dim. There is no silence threshold that is both polite and responsive, because the signal you need — is this thought complete — isn't in the silence at all.
The fix is to read the sentence, not the clock
What the ecosystem converged on in 2025–26 is a dedicated end-of-utterance model: a small classifier that estimates the probability you're finished from the transcript so far, the prosody of your voice, or both — and then sets the response timing from that probability. Trail off with a rising "uhm…" and the score stays low and the agent waits. Land a complete sentence with falling intonation and it scores high and the agent answers immediately. Same conversation, but now latency is low and the agent stops cutting you off, because the two goals were never actually in tension — the silence timer just made them look that way.
The approaches differ mainly in where they read the signal. Pipecat's Smart Turn (BSD-2, open weights, community-trained) reads the audio — a native-waveform model that judges intonation and pace, outputs a single "complete" probability, and is small enough to run on a CPU. LiveKit's open turn detector fuses an LLM backbone that reads the transcript with audio encoders that hear the delivery, across 14 languages. OpenAI's Realtime API exposes the idea as a semantic_vad mode with an eagerness dial — low, medium, high — that caps how long it will wait before responding. Deepgram's Flux takes the most integrated route, baking end-of-turn prediction directly into the speech-to-text model so there's no separate component to bolt on. Different architectures, one shared admission: silence was never the right feature.
If you're choosing where this lives in your stack, it tracks the larger speech-to-speech versus cascaded split — an STT-integrated model like Flux suits a cascaded pipeline, while a standalone turn detector drops into a framework like LiveKit or Pipecat regardless of which transcription model you feed it.
The mirror-image bug: barge-in
End-of-turn is only half the conversation. The other half is what happens when you interrupt the agent. The naive version is symmetrical to the naive timeout: any speech the microphone hears while the agent is talking stops the agent. And it's wrong for the symmetrical reason. Human listeners constantly emit backchannels — "uh-huh," "right," "yeah" — that mean keep going, not stop. An agent that flinches silent every time you murmur agreement is as broken as one that interrupts you; it just fails politely instead of rudely.
So the current frontier in interruption handling is the same move applied in reverse: a model that separates a genuine barge-in (new content, you want the floor) from a backchannel (an acknowledgement, keep talking) — and, when it guesses wrong and pauses for what turns out to be an "mm-hm," resumes the turn where it left off. It's the same lesson the silence timer taught, stated once more: in conversation, the presence of sound tells you almost nothing. The meaning of the sound tells you everything, and that's the part you have to model.



