Back

OpenAI Unveils Three Real‑Time Voice Models, Expanding AI to Live Conversation, Translation and Streaming Transcription

OpenAI rolled out three new audio models on Tuesday, giving developers a toolbox that moves voice AI from scripted replies to fluid, real‑time interaction. The headline model, GPT‑Realtime‑2, brings the reasoning power of GPT‑5 to live spoken dialogue. It can juggle multiple tools in a single request, narrate its actions and maintain coherence over longer exchanges thanks to a 128K‑token context window. Developers can also dial the model’s reasoning effort up or down, matching compute to the complexity of the user’s query.

Equally striking is GPT‑Realtime‑Translate, which OpenAI touts as the closest approximation to Star Trek’s Universal Translator. The model supports live speech translation from more than 70 source languages into 13 target languages. In demo footage, a new participant speaking a different language joined an ongoing conversation and the system instantly rendered both speakers into English without missing a beat.

The third offering, GPT‑Realtime‑Whisper, tackles a long‑standing limitation of speech‑to‑text services: latency. Unlike batch transcription models that wait for a speaker to pause, Whisper streams text as the words are spoken. The capability is ideal for live captions, meeting notes and any workflow where waiting for a full transcript would be a bottleneck.

OpenAI has opened the models to developers today, and several companies are already testing them. Real‑estate platform Zillow is prototyping a voice assistant that can search listings and schedule tours with a single spoken command. Travel aggregator Priceline is experimenting with voice‑driven flight and hotel management, including cancellations and rebookings. Video‑hosting service Vimeo plans to embed Whisper for real‑time captioning of live streams.

Pricing varies by model. Whisper costs $0.017 per minute of audio, Translate is $0.034 per minute, and GPT‑Realtime‑2 is billed at $32 for each million audio input tokens. The tiered structure reflects the differing compute demands of transcription, translation and full‑scale reasoning.

Industry observers see the launch as a watershed moment for voice‑first applications. By combining deep reasoning, multilingual translation and instant transcription, OpenAI gives developers the building blocks to create assistants that can book appointments, troubleshoot problems and facilitate cross‑language collaboration—all without the user having to type a single word.

Used: News Factory APP - news discovery and automation - ChatGPT for Business

Source: Digital Trends

Also available in: