multiai-tts

multiai-tts is an extension library for multiai that provides Text-to-Speech (TTS) capabilities using OpenAI, Google GenAI, Azure Speech, and VOICEVOX.

Supported AI providers
Prerequisites
Installation
Usage
Style prompts
Long text (automatic chunking)

Supported AI providers

Provider	Strengths	Docs
OpenAI	Simple API	Models · Voices · API
Google GenAI	Emotion tags, multi-speaker	Models · Voices · API
Azure Speech	SSML, extensive voice selection	Voices · API
VOICEVOX	Local engine, no API key, Japanese voices	Voices · API

Install multiai and run

ai --list -o | grep tts to display OpenAI’s TTS models
ai --list -g | grep tts to display Google GenAI’s TTS models

Prerequisites

API key configuration

This library relies on the configuration provided by multiai. You must set up your API keys (OpenAI API Key, Google API Key, Azure TTS Key and Region) using multiai’s configuration files or environment variables before using this library.

For details on how to configure API keys, please refer to the multiai documentation.

VOICEVOX does not require an API key. Instead, the VOICEVOX engine must be running locally (default http://127.0.0.1:50021).

System requirements

ffmpeg must be installed if you want to save audio in formats other than WAV (e.g., MP3).
The VOICEVOX engine must be running to use the VOICEVOX provider.

Installation

pip install multiai-tts

Usage

Google GenAI example

import sys
import multiai_tts

client = multiai_tts.Prompt()
client.set_tts_model('google', 'gemini-3.1-tts-flash-preview')
client.tts_voice_google = 'charon'

# Speak directly
client.speak("Please speak the following. Hello, this is a test from Google model.")
if client.error:
    print(client.error_message)
    sys.exit(1)

# Save to file
client.save_tts("Please speak the following. Saving this audio to mp3.", "output_google.mp3")
if client.error:
    print(client.error_message)
    sys.exit(1)

OpenAI example

import sys
import multiai_tts

client = multiai_tts.Prompt()
client.set_tts_model('openai', 'gpt-4o-mini-tts')
client.tts_voice_openai = 'marin'

# Speak directly
client.speak("Hello, this is a test from OpenAI model.")
if client.error:
    print(client.error_message)
    sys.exit(1)

# Save to file
client.save_tts("Saving this audio to mp3.", "output_openai.mp3")
if client.error:
    print(client.error_message)
    sys.exit(1)

Azure TTS example

import sys
import multiai_tts

client = multiai_tts.Prompt()
client.set_tts_provider('azure')
client.tts_voice_azure = 'en-US-JennyNeural'

# Speak directly
client.speak("Hello, this is a test from Azure TTS.")
if client.error:
    print(client.error_message)
    sys.exit(1)

# Save to file
client.save_tts("Saving this audio to mp3.", "output_azure.mp3")
if client.error:
    print(client.error_message)
    sys.exit(1)

VOICEVOX example

VOICEVOX runs as a local engine (default http://127.0.0.1:50021) and must already be running. No API key is required. The speaker is selected by its integer style ID via tts_voice_voicevox.

import sys
import multiai_tts

client = multiai_tts.Prompt()
client.set_tts_provider('voicevox')
client.tts_voice_voicevox = 3          # speaker style ID
# client.tts_voicevox_url = "http://127.0.0.1:50021"  # default; override if needed

# Speak directly
client.speak("こんにちは。VOICEVOX のテストです。")
if client.error:
    print(client.error_message)
    sys.exit(1)

# Save to file
client.save_tts("音声をファイルに保存します。", "output_voicevox.mp3")
if client.error:
    print(client.error_message)
    sys.exit(1)

Notes

For OpenAI and Google TTS, use set_tts_model(provider, model) to select both provider and model.
For Azure, set_tts_provider('azure') is sufficient; the model parameter is not used.
For VOICEVOX, set_tts_provider('voicevox') is sufficient; no API key or model is used. The engine must already be running — if it is not reachable, client.error is set with a message asking whether the engine is running.
In Google’s example, the prompt includes “Please speak the following.” In the OpenAI and Azure examples, it does not. Whether you include this phrase depends on the model you use.
Prompt.get_wav() fetches the raw audio data in memory. Playback is separate from retrieval.
Error handling: After speak() or save_tts(), always check client.error and client.error_message.
WAV output is default; ffmpeg is used for converting to other formats.

Style prompts

Both speak() and save_tts() accept an optional prompt argument: a style instruction (voice, tone, speed, emotion, …) that is separate from the spoken text. The prompt is not read aloud and is not subject to chunk splitting.

client.speak(
    "Hello, this is a test.",
    prompt="Speak cheerfully and a little slowly.",
)

The prompt is prepended to the text before synthesis, using the same rule for every provider — whether a style prompt helps and how to phrase it is up to you.

When the text is chunked (see below), the prompt is re-applied to every chunk so the style stays consistent across the whole audio. Because the prompt is kept separate from the body, chunk_size is measured against the spoken text length only — the prompt length never eats into it. Leaving prompt empty (the default) reproduces the original behavior exactly.

Long text (automatic chunking)

When the text is long — whether it exceeds a provider’s request length limit or degrades in quality with longer input (as is the case with some Gemini models) — speak() and save_tts() can automatically split the text into chunks, synthesize each chunk, and join the resulting audio.

# Split into chunks of at most ~1000 characters and join the audio
client.save_tts(long_text, "output.mp3", chunk_size=1000)
if client.error:
    print(client.error_message)

Parameter	Type	Default	Description
`prompt`	`str`	`""`	Style instruction applied to every chunk (see Style prompts). Not part of the spoken text and not subject to splitting.
`chunk_size`	`int` or `None`	`None`	Maximum characters per chunk, measured against the spoken text only. `None` disables splitting (original behavior).
`split_chars`	`str`	`"。．.!！?？\n"`	Candidate split characters. The split point is just after the rightmost candidate found within `chunk_size`.
`chunk_overflow`	`str`	`"extend"`	Behavior when no candidate is found within `chunk_size`: `"extend"` reads on until the next candidate (or end of text); `"error"` sets `client.error` and stops.

split_text() is also exposed directly if you only need the chunk boundaries:

chunks = client.split_text(long_text, chunk_size=1000)

Caveats

At chunk boundaries pitch, tempo, and trailing reverberation may shift slightly. This is an inherent limitation of the TTS APIs (each chunk is synthesized independently); no silence is inserted between chunks.
Some providers (e.g. Gemini 3.1 Flash TTS) are known to degrade in quality on long inputs even within API limits; chunking can mitigate this.
With chunk_overflow="extend", the actual chunk size may significantly exceed chunk_size.
Per-provider API character limits are not managed by this library — you are responsible for choosing an appropriate chunk_size.