Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

OpenAI unveils new tool to simplify AI voice assistant development

OpenAI has recently unveiled the ‘Realtime API’ in public beta, a tool designed to help developers create low-latency, voice-interactive applications.
According to OpenAI, this new API allows for natural speech-to-speech conversations, integrating multiple processes like speech recognition and text-to-speech into a single step. This is expected to simplify the development of applications that involve real-time voice interactions.
Previously, developers who wanted to create AI voice assistants needed to use multiple steps to make the system work. First, they had to convert audio to text using a speech recognition tool, then feed that text into an AI model to generate a response, and finally convert the response back into speech. This method could lead to delays and made conversations feel less natural.
The Realtime API simplifies this process by combining everything into one step. Now, developers can handle both the input (what the user says) and the output (how the app responds) in a single API call. This means conversations will flow more smoothly, with less waiting time, and responses will sound more human-like, preserving emotions and tone, as per OpenAI in its official blog.
Under the hood, the Realtime API connects apps to OpenAI’s GPT-4o model. The API uses a WebSocket connection, allowing messages to be exchanged between the app and the AI in real-time. OpenAI says this new system is faster and more fluid compared to previous methods, which could sometimes feel robotic or delayed.
OpenAI has already been testing the Realtime API with select partners. For instance, Speak, a language-learning app, uses it to power role-playing conversations where users can practice speaking in a foreign language. Another app, Healthify, uses the API to let users have natural conversations with Ria, an AI coach who helps with nutrition and fitness advice.
The Realtime API is available now in public beta for all paid developers. It operates using tokens, with pricing depending on whether the input is text or audio. For audio, input costs $0.06 per minute, while output is $0.24 per minute.
In addition to the Realtime API, OpenAI will soon release audio capabilities in its Chat Completions API, which will allow developers to input and output audio or text, though at slower speeds than real-time conversations.

en_USEnglish