Batch TTS is dramatically simpler than streaming — one HTTP call returns one audio file. Use it for any pre-rendered content: IVR prompts, voicemail greetings, podcast generation, video voiceover, accessibility playback, audio notifications.
Batch TTS via REST (OpenAI example)
Python: synchronous batch TTS
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY")
response = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input="Welcome to Team-Connect. Please hold while I find an agent for you.",
response_format="mp3",
speed=1.0
)
with open("greeting.mp3", "wb") as f:
f.write(response.content)
That's it. The full audio file is in greeting.mp3 after the call returns. Generation typically takes 1-3 seconds for short utterances. For longer text (over a few hundred words), use streaming response mode and pipe to disk:
Python: streaming batch TTS for long text
with client.audio.speech.with_streaming_response.create(
model="tts-1-hd",
voice="nova",
input=long_text,
response_format="mp3"
) as response:
response.stream_to_file("output.mp3")
Cache pre-rendered audio aggressively
If you have static prompts that play frequently — greetings, hold messages, error confirmations — render them once with batch TTS, store the audio file, and serve it from disk or CDN forever. Each cached prompt saves a TTS API call per use, which adds up to real money at scale.
Python: pre-render + cache pattern
import hashlib
import os
CACHE_DIR = "/var/cache/tts"
def get_or_render(text: str, voice: str = "nova") -> str:
"""Return path to cached audio file, rendering if necessary."""
key = hashlib.sha256(f"{voice}:{text}".encode()).hexdigest()
path = os.path.join(CACHE_DIR, f"{key}.mp3")
if os.path.exists(path):
return path
response = client.audio.speech.create(
model="tts-1-hd", voice=voice, input=text, response_format="mp3"
)
with open(path, "wb") as f:
f.write(response.content)
return path
# Usage:
audio_path = get_or_render("Welcome to Team-Connect. How can I help?")
play_audio_file(audio_path) # served from disk, ~1ms
Self-hosted TTS
For air-gapped, data-sovereignty or high-volume use cases, open-source TTS has matured significantly:
Python: Piper for fast on-CPU TTS
# pip install piper-tts
import wave
from piper.voice import PiperVoice
voice = PiperVoice.load("en_GB-alba-medium.onnx")
with wave.open("output.wav", "wb") as wav:
voice.synthesize("Hello from Team-Connect.", wav)
Python: Kokoro (Apache-licensed, ~80M parameters, near-commercial quality)
# pip install kokoro
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code='b') # 'b' = British English
generator = pipeline(
"Hello from Team-Connect. How can I help today?",
voice='bf_emma',
speed=1.0
)
for i, (graphemes, phonemes, audio) in enumerate(generator):
sf.write(f"output-{i}.wav", audio, 24000)
Other self-hosted options worth knowing
- Piper — small, fast, runs on CPU and even Raspberry Pi. The default for embedded and edge.
- Kokoro — 2024 release, Apache-licensed, ~80M parameters, quality close to commercial providers. The current sweet spot for self-hosting.
- F5-TTS — 2024 diffusion-based model with strong voice cloning from short samples. Heavier but high-quality.
- OuteTTS — uses LLM-style autoregressive token generation, supports voice cloning, MIT licensed.
- Coqui XTTS-v2 — multilingual, 17 languages, voice cloning, formerly commercial now community-maintained.
- Suno Bark — expressive but slow and prone to hallucination on long inputs; better for short emotional bursts than long-form.