TTS extension for multiai using OpenAI, Google GenAI and Azure Speech
multiai-tts is an extension library for multiai that provides Text-to-Speech (TTS) capabilities using OpenAI, Google GenAI, and Azure Speech.
| Provider | Strengths | Docs |
|---|---|---|
| OpenAI | Simple API | Models · Voices · API |
| Google GenAI | Emotion tags, multi-speaker | Models · Voices · API |
| Azure Speech | SSML, extensive voice selection | Voices · API |
Install multiai and run
ai --list -o | grep tts to display OpenAI’s TTS modelsai --list -g | grep tts to display Google GenAI’s TTS modelsAPI key configuration
This library relies on the configuration provided by multiai. You must set up your API keys (OpenAI API Key, Google API Key, Azure TTS Key and Region) using multiai’s configuration files or environment variables before using this library.
For details on how to configure API keys, please refer to the multiai documentation.
System requirements
ffmpeg must be installed if you want to save audio in formats other than WAV (e.g., MP3).pip install multiai-tts
import sys
import multiai_tts
client = multiai_tts.Prompt()
client.set_tts_model('google', 'gemini-3.1-tts-flash-preview')
client.tts_voice_google = 'charon'
# Speak directly
client.speak("Please speak the following. Hello, this is a test from Google model.")
if client.error:
print(client.error_message)
sys.exit(1)
# Save to file
client.save_tts("Please speak the following. Saving this audio to mp3.", "output_google.mp3")
if client.error:
print(client.error_message)
sys.exit(1)
import sys
import multiai_tts
client = multiai_tts.Prompt()
client.set_tts_model('openai', 'gpt-4o-mini-tts')
client.tts_voice_openai = 'marin'
# Speak directly
client.speak("Hello, this is a test from OpenAI model.")
if client.error:
print(client.error_message)
sys.exit(1)
# Save to file
client.save_tts("Saving this audio to mp3.", "output_openai.mp3")
if client.error:
print(client.error_message)
sys.exit(1)
import sys
import multiai_tts
client = multiai_tts.Prompt()
client.set_tts_provider('azure')
client.tts_voice_azure = 'en-US-JennyNeural'
# Speak directly
client.speak("Hello, this is a test from Azure TTS.")
if client.error:
print(client.error_message)
sys.exit(1)
# Save to file
client.save_tts("Saving this audio to mp3.", "output_azure.mp3")
if client.error:
print(client.error_message)
sys.exit(1)
set_tts_model(provider, model) to select both provider and model.set_tts_provider('azure') is sufficient; the model parameter is not used.Prompt.get_wav() fetches the raw audio data in memory. Playback is separate from retrieval.speak() or save_tts(), always check client.error and client.error_message.ffmpeg is used for converting to other formats.Both speak() and save_tts() accept an optional prompt argument: a style
instruction (voice, tone, speed, emotion, …) that is separate from the spoken
text. The prompt is not read aloud and is not subject to chunk splitting.
client.speak(
"Hello, this is a test.",
prompt="Speak cheerfully and a little slowly.",
)
The prompt is prepended to the text before synthesis, using the same rule for every provider — whether a style prompt helps and how to phrase it is up to you.
When the text is chunked (see below), the prompt is re-applied to every
chunk so the style stays consistent across the whole audio. Because the
prompt is kept separate from the body, chunk_size is measured against the
spoken text length only — the prompt length never eats into it. Leaving
prompt empty (the default) reproduces the original behavior exactly.
When the text is long — whether it exceeds a provider’s request length limit
or degrades in quality with longer input (as is the case with some Gemini
models) — speak() and save_tts() can automatically split the text into
chunks, synthesize each chunk, and join the resulting audio.
# Split into chunks of at most ~1000 characters and join the audio
client.save_tts(long_text, "output.mp3", chunk_size=1000)
if client.error:
print(client.error_message)
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
"" |
Style instruction applied to every chunk (see Style prompts). Not part of the spoken text and not subject to splitting. |
chunk_size |
int or None |
None |
Maximum characters per chunk, measured against the spoken text only. None disables splitting (original behavior). |
split_chars |
str |
"。..!!??\n" |
Candidate split characters. The split point is just after the rightmost candidate found within chunk_size. |
chunk_overflow |
str |
"extend" |
Behavior when no candidate is found within chunk_size: "extend" reads on until the next candidate (or end of text); "error" sets client.error and stops. |
split_text() is also exposed directly if you only need the chunk boundaries:
chunks = client.split_text(long_text, chunk_size=1000)
Caveats
chunk_overflow="extend", the actual chunk size may significantly
exceed chunk_size.chunk_size.