OpenAI Steering Tts

Steering Tts

Export

Run Notebooks

idle

Contents

No cells yet

Add cells to see them here

Steering Text-to-Speech for more dynamic audio generation

Our traditional TTS APIs don't have the ability to steer the voice of the generated audio. For example, if you wanted to convert a paragraph of text to audio, you would not be able to give any specific instructions on audio generation.

With audio chat completions, you can give specific instructions before generating the audio. This allows you to tell the API to speak at different speeds, tones, and accents. With appropriate instructions, these voices can be more dynamic, natural, and context-appropriate.

Traditional TTS

Traditional TTS can specify voices, but not the tone, accent, or any other contextual audio parameters.

[2]

Chat Completions TTS

With chat completions, you can give specific instructions before generating the audio. In the following example, we generate a British accent in a learning setting for children. This is particularly useful for educational applications where the voice of the assistant is important for the learning experience.

[4]

Chat Completions Multilingual TTS

We can also generate audio in different language accents. In the following example, we generate audio in a specific Spanish Uruguayan accent.

[12]

Había una vez un leoncito llamado Leo que se despertó con el aroma de panqueques y huevos revueltos. Su pancita gruñía de emoción mientras corría hacia la cocina. ¡Mamá León había preparado un festín de desayuno! Leo devoró sus panqueques, sorbió su jugo de naranja y mordisqueó algunas bayas jugosas.

Conclusion

The ability to steer the voice of the generated audio opens up a lot of possibilities for richer audio experiences. There are many use cases such as:

Enhanced Expressiveness: Steerable TTS allows adjustments in tone, pitch, speed, and emotion, enabling the voice to convey different moods (e.g., excitement, calmness, urgency).
Language learning and education: Steerable TTS can mimic accents, inflections, and pronunciation, which is beneficial for language learners and educational applications where accurate intonation and emphasis are critical.
Contextual Voice: Steerable TTS adapts the voice to fit the content’s context, such as formal tones for professional documents or friendly, conversational styles for social interactions. This helps create more natural conversations in virtual assistants and chatbots.