Two years ago, building a voice agent meant stitching three services together: a transcriber, a language model, and a text-to-speech engine. That cascaded pipeline was where latency and lost emotion went to live. In 2026 the two hyperscaler answers — OpenAI's Realtime API and Google's Gemini Live API — both collapse the stack into one model that hears audio and speaks audio, keeping tone instead of flattening it through a transcript.
So the choice between them is no longer "whose transcription is better." Both are genuinely speech-to-speech. The choice is an operations decision wearing a model-selection costume, and the spec sheet hides where the real money and the real failure modes are.
The price tag is a trap
Start with what every comparison leads on. OpenAI's generally available gpt-realtime bills $32 per million audio input tokens and $64 per million output. Gemini 2.5's native-audio Live model bills $3 and $12. Read those two lines and you book Gemini and move on. A roughly tenfold gap is not a rounding error.
Except a voice session is not a single request. It is a long, stateful conversation, and the two providers meter that conversation differently. The Gemini Live API bills you for every token in the session context window, on every turn. Because the model retains the conversation as raw audio tokens to preserve acoustic nuance, the accumulated audio is re-charged each time the agent responds. A two-minute exchange and a twenty-minute exchange do not scale linearly — the back half of a long call pays for the front half again and again.
The cheaper per-token number is attached to the more expensive billing model. That is the whole game.
OpenAI's flat per-token meter, by contrast, charges each chunk of audio once. Its own cost math is mundane: user audio is one token per 100ms, assistant audio one token per 50ms. Predictable, if pricier per unit. Which backend is actually cheaper depends entirely on your median call length — and you cannot know that from the headline rate. Convert both to cost-per-minute at your call length before you trust the sticker.
The durable difference is transport
Prices move; price cuts arrive every few months. The architectural fork is stickier, and it is about how bytes reach the model.
OpenAI ships three transports: WebRTC for browsers and mobile, WebSocket for server-to-server, and — the one that matters for production phone agents — native SIP. You can point a phone number, a PBX, or a desk phone directly at sip.api.openai.com and have a voice agent answering calls with no media-server glue in between.
Gemini Live is WebSocket-first. There is no first-party WebRTC and no first-party SIP. For a browser client that wants jitter buffering and NAT traversal, or a phone number that speaks PSTN, you bring your own gateway — Twilio, LiveKit, or Pipecat. That is not a dealbreaker; those tools are excellent and many teams run them anyway. But it is real integration work that the per-token price never mentions, and it changes who is on call when the audio drops.
Who owns the reconnection
The session model compounds the transport gap. OpenAI caps a Realtime session at 60 minutes — a hard ceiling, but a simple one. Gemini's WebSocket resets roughly every ten minutes, and a native-audio session runs about 15 minutes before you must turn on context-window compression to continue.
The fix Google ships is session resumption: cache the resumption token, reconnect transparently, keep the context. It works. But notice what just happened — the "unlimited session" you were promised is unlimited only if you build the reconnection plumbing yourself. The failure mode is different from OpenAI's, too: OpenAI's session ends and you start a new one; Gemini's connection dies mid-call and the user hears nothing unless your resumption code fires correctly.
How to actually choose
The honest decision rule is short. If you are building a phone agent — inbound support, outbound calling, anything touching the PSTN — OpenAI's native SIP and flat billing remove two layers of work, and you pay for that convenience. If you are building a browser or app experience, already run LiveKit or Pipecat, and your calls are short, Gemini's per-token price and its affective native audio are hard to beat — provided you've measured the per-turn re-billing on a realistic call.
What you should not do is pick on the number in the pricing table. In voice, the cost that sinks a budget and the bug that wakes you at 3am both live one layer below the sticker — in who stores the audio, who re-bills it, and who owns the socket when it drops.



