Almost every agent framework worth its stars is built on an assumption so deep nobody states it: there is a turn. The user produces input, the loop runs, the agent produces output, and then — crucially — everyone waits politely for the next turn. Tracing, evals, tool-calling, the whole apparatus this column keeps recommending, all of it lives inside that discrete request-response rhythm borrowed from the web.
Voice deletes the turn. When you build an agent you can interrupt mid-sentence, that you can talk over, that has to decide in real time whether your "mm-hm" was a request to stop or just a listener nodding along — you are no longer in request-response. You are in a duplex stream where the model must listen and speak at the same time, and the politeness assumption is the first thing to break. The repos that handle this are not text agents with a microphone bolted on. They are a different species, and worth knowing even if you never ship a phone line, because they show what the loop looks like once you remove the one assumption everyone else relies on.
The orchestrators: managing a pipeline that can't pause
The dominant pattern is still a pipeline — speech-to-text, then the LLM, then text-to-speech — but strung together so it streams, with a latency budget instead of a turn.
Pipecat is where most people start, because it makes the pipeline legible: each stage is a frame processor, audio flows through as a stream of frames, and interruption is a first-class event rather than an afterthought. It reached v1.0 in April. The thing it teaches you, just by its shape, is that the hard part of voice is not the models — it is the plumbing between them, and specifically what happens to the half-spoken sentence when the user cuts in.
LiveKit Agents comes at it from the network layer, which is the right instinct. It is built on WebRTC — the same transport your video calls use — because once you are chasing sub-second voice-to-voice latency, transport is the product. Its recent work is telling: adaptive interruption handling, dynamic endpointing (deciding when you've actually stopped talking), and preemptive generation that starts composing a reply before you've finished. Those are not features you need in a text agent. They only exist because the turn is gone.
Vocode is the unbundled option — fewer opinions, more knobs — for teams who want to own the latency budget stage by stage rather than accept a framework's defaults.
A text agent waits for you to finish. A voice agent has to gamble on when you're finished — and start talking before it knows.
The TTS floor
The pipeline is only as fast as its slowest stage, and for a long time that was speech synthesis. Open weights changed the math.
Kokoro is an 82-million-parameter TTS model — tiny — that sounds far better than its size has any right to, under a permissive license you can run anywhere. (Disclosure: it is the voice you hear narrating pieces on this site.) Its existence is why the synthesis stage stopped being the bottleneck and the bottleneck moved back to coordination — which is the whole point.
The end-to-end heresy
Here is the part that makes the rest of this list provisional. The entire STT→LLM→TTS pipeline exists to bridge a model that thinks in text to a world that speaks in sound. What if the model just spoke?
Moshi is the argument that the pipeline is a workaround. Instead of three models in a row, it models the conversation as two simultaneous audio streams — yours and its — and generates speech directly, predicting text alongside only to sharpen the audio. There is no turn to detect because both parties are always producing; full-duplex is the native mode, not a feature layered on top. Latency drops to roughly 200ms not because the plumbing got faster but because the plumbing is gone.
That is the real tension in this corner of the stack. The orchestrators are getting very good at hiding the seams of a pipeline that, if the end-to-end models keep improving, may not need to exist. Both bets are worth understanding, because they disagree about something fundamental: whether voice is a transport problem you engineer around the model, or a modeling problem that dissolves the transport. Either way, the assumption that an agent takes turns — the one every other framework is built on — was never a law. It was just the only thing text could do.



