Team-Connect Engineering Guide · Updated 3 May 2026

TTS Configuration

Q: What is TTS (text-to-speech)?

TTS (text-to-speech) is the technology that converts written text into spoken audio. Modern TTS uses neural networks - typically transformer or diffusion-based architectures - to synthesise speech that is increasingly indistinguishable from human recordings. TTS is the output half of voice AI: after a language model generates a response, TTS turns that response into audio the user hears. It is the counterpart to ASR (automatic speech recognition), which handles the input direction.

Q: What is streaming TTS?

Streaming TTS starts emitting audio bytes as soon as the first words are synthesised, rather than waiting for the full text to complete. This collapses time-to-first-audio from seconds to 100-300ms, which is essential for voice AI feeling conversational. Streaming TTS providers like Cartesia and Deepgram Aura achieve under 100ms first-byte latency; ElevenLabs streaming runs around 200-400ms. Without streaming TTS, you cannot build a real-time voice AI - users will perceive every response as slow.

Q: What is SSML and do I need it?

SSML (Speech Synthesis Markup Language) is an XML-based standard for controlling how TTS pronounces text. Common SSML elements include break (insert pauses), emphasis (stress words), prosody (change rate, pitch, volume), say-as (interpret text as numbers, dates, telephone numbers, currency), and phoneme (specify exact pronunciation). Most major TTS providers support a subset of SSML; ElevenLabs uses its own simpler annotation. SSML is essential for high-quality TTS in production - without it, names, numbers, dates, addresses and abbreviations are mispronounced regularly.

Q: What latency should I expect from real-time TTS?

For real-time voice AI, you need time-to-first-audio under 200ms - any longer and the conversation feels sluggish. Cartesia Sonic and Deepgram Aura deliver 40-100ms first-byte latency. ElevenLabs streaming runs 200-400ms with their Turbo model. OpenAI TTS streaming is 300-600ms. The metric that matters is not generation speed (real-time factor) but first-byte latency - how fast the first audio chunk reaches the user's ear. After the first chunk, subsequent chunks should arrive faster than playback so the audio never stutters.

Q: Can I run TTS without sending text to the cloud?

Yes. Open-source TTS has improved significantly: Piper (Rhasspy project) is small and runs on CPU, useful for embedded; Kokoro is a 2024 Apache-licensed model rivalling commercial quality at ~80M parameters; F5-TTS is a 2024 diffusion model with strong voice cloning; Coqui TTS (now community-maintained) and OuteTTS are also viable. Self-hosting trades operational complexity for data sovereignty and per-character cost savings at high volumes. For most voice AI products, managed TTS is dramatically simpler and competitive on cost up to millions of characters per month.

A practical engineer's guide to TTS configuration — how text-to-speech providers compare on naturalness and latency, what streaming versus batch actually means in production, the SSML controls that turn robotic delivery into convincing speech, and the latency budgets that make voice AI feel conversational.

Covers ElevenLabs, Cartesia, OpenAI, Deepgram Aura · Real-time + batch · SSML + voice cloning

Jump to a section

01 · Foundations

What is TTS?

Definition, neural vs older approaches, where it sits in voice AI.

02 · Decision

Choosing a TTS Provider

ElevenLabs, Cartesia, OpenAI, Deepgram, Google, Azure compared honestly.

03 · Decision

Streaming vs Batch TTS

Why time-to-first-audio is the only latency number that matters.

04 · Voices

Voice Selection & Cloning

Stock voices, custom voices, voice cloning ethics, brand consistency.

05 · Code

Real-Time TTS Setup

Working WebSocket streaming code with chunk handling and barge-in.

06 · Code

Batch / File TTS Setup

REST API examples and self-hosted Piper/Kokoro code.

07 · Control

SSML & Prosody Control

break, emphasis, prosody, say-as, phoneme — with real examples.

08 · Quality

MOS & Quality Metrics

How TTS quality is actually measured, what's good in 2026.

09 · Architecture

TTS in Voice AI

Latency budgets, barge-in, streaming chunks, audio buffering.

10 · Debug

Troubleshooting

Mispronunciation, robotic delivery, audio glitches, latency spikes.

01What is TTS (Text-to-Speech)?

TTS (text-to-speech) — also called speech synthesis — is the technology that converts written text into spoken audio. In voice AI applications it is the output stage: after the LLM generates a response, TTS turns that response into the voice the user actually hears. TTS quality is what makes the difference between a voice AI that feels human and one that feels like a robotic phone tree from 2008.

How TTS has evolved

TTS has gone through three generations, each fundamentally different from the last:

Concatenative TTS (1990s-2010s) — record a human voice actor saying thousands of phonetic units, stitch them together at runtime. Sounded acceptable on common phrases but obviously synthetic on anything novel.
Parametric / HMM TTS (2000s-2015) — model speech as mathematical parameters and generate audio from rules. Smoother than concatenative but uniformly robotic. The "Stephen Hawking voice" generation.
Neural TTS (2017-now) — deep neural networks (Tacotron, WaveNet, FastSpeech, VITS, recent diffusion-based architectures) learn end-to-end from text to audio waveforms. Quality has crossed the threshold where casual listeners cannot reliably distinguish it from human recordings.

Every credible TTS provider in 2026 uses neural TTS — the older approaches survive only in legacy embedded devices and accessibility software.

Where TTS fits in a voice AI stack

TTS is the final stage of the voice AI pipeline:

Audio capture from phone (SIP/PSTN) or browser (WebRTC).
Audio encoding (Opus, G.711, µ-law).
Voice recognition (ASR) — audio → text.
LLM — text → response text.
TTS — response text → audio. This is what this guide covers.
Audio playback back through the same channel.

The latency reality: for voice AI to feel conversational, the entire user-stops-speaking-to-AI-starts-speaking gap should be under 1 second. TTS gets a budget of 100-200ms for time-to-first-audio — the moment the first audio bytes leave your server. After that first chunk arrives, subsequent chunks need to keep up with playback. Get TTS first-byte latency under 200ms and your voice AI will feel snappy; let it drift over 500ms and the conversation feels broken regardless of how good your LLM is.

02Choosing a TTS Provider

The TTS market in 2026 has fragmented into specialised providers optimised for different priorities. Quality differences between the top tier are small enough that latency, pricing model and feature support (voice cloning, language coverage, SSML depth) usually drive the decision more than raw naturalness.

Honest provider comparison

Provider	Strongest at	Weakest at	Best for
ElevenLabs	Naturalness, emotional expressiveness, voice cloning quality, large voice library	Cost (premium pricing per character), latency higher than purpose-built voice-AI providers	Premium voice products, audiobooks, brand voices, content creation
Cartesia (Sonic)	Lowest latency in the market (40-90ms first-byte), purpose-built for real-time voice AI	Smaller voice library than ElevenLabs, fewer languages	Real-time voice agents, live translation, conversational AI
Deepgram Aura	Sub-100ms latency, tight integration with Deepgram ASR for end-to-end voice AI	Smaller voice selection than competitors, English-focused	Voice AI products already using Deepgram for ASR
OpenAI TTS (tts-1, tts-1-hd, gpt-4o-tts)	Cheap per character, decent quality, simple API	Less expressive than ElevenLabs, fewer voices, no voice cloning	High-volume utility TTS, summarisation playback, accessibility
Google Cloud TTS	50+ languages, mature platform, GCP integration, Studio voices for premium quality	Latency higher than streaming-first providers, pricing complex across model tiers	Multilingual products, GCP-native pipelines, IVR systems
Azure AI Speech	140+ languages and locales, custom neural voice training, enterprise compliance	SDK quality varies by language, custom voice training requires significant audio data	Microsoft-centric enterprises, multilingual call centres, regulated industries
Amazon Polly	AWS-native, mature, generative voices added in 2024 are competitive on quality	Older neural voices noticeably behind newer providers, slower release cadence	AWS-native pipelines, batch-heavy workloads, established enterprises
Play.ht	Voice cloning, content-creator focus, large voice marketplace	Latency not optimised for real-time, more product than platform	Marketing content, podcast production, voiceover
Resemble AI	Voice cloning quality, on-prem option, deepfake detection bundled	Smaller ecosystem than ElevenLabs	Custom branded voices, regulated voice cloning use cases

The decision framework

Decide along three axes, in this order:

Real-time or pre-rendered? Real-time voice AI under 200ms first-byte latency narrows the field to Cartesia, Deepgram Aura, ElevenLabs Turbo, and Google Realtime. Anything else is too slow for live conversation.
Premium naturalness or cost-efficiency? ElevenLabs and Cartesia are the premium tier (more expressive, more expensive). OpenAI TTS and AWS Polly are the cost tier (acceptable quality, fraction of the price). The premium tier is roughly 5-10x more expensive per character.
Voice cloning needed? If yes, ElevenLabs (instant cloning from short samples), Cartesia (instant cloning), Resemble AI, or Play.ht. Most other providers don't support cloning.

What Team-Connect uses, and why: our voice AI agents use ElevenLabs for production calls. The decision drivers were naturalness (the voice has to sound human enough that customers don't immediately disengage), the multi-voice library (different agents can have different voice identities), and the streaming API quality. We accept the higher cost in exchange for the call-completion uplift that genuinely natural voices produce. For high-volume non-interactive use cases (voicemail summaries, bulk notifications) we sometimes drop to OpenAI TTS to manage cost. This kind of "premium-streaming + cheap-batch" split is common in voice AI operations.

03Streaming vs Batch TTS

The single most important architectural decision in TTS configuration is streaming versus batch. The difference is bigger than it sounds — it determines whether your voice AI feels conversational or feels like a podcast that takes 3 seconds to start playing.

Streaming TTS

The provider starts emitting audio bytes as soon as the first words are synthesised, rather than waiting for the entire input text to render. The protocol is usually WebSocket or HTTP chunked transfer encoding. Time-to-first-audio drops from "a few seconds" to "under 200ms" with the right provider, which is the difference between a voice AI that feels broken and one that feels conversational.

Time-to-first-audio: 40-300ms depending on provider.
Use cases: live voice AI, real-time voice agents, conversational interfaces, live translation.
Cost: generally same per character as batch, sometimes slightly higher.
Quality: identical to batch in modern systems — the model produces the same waveform either way; streaming just delivers it incrementally.

Batch (non-streaming) TTS

You POST text to an endpoint and receive a complete audio file back when synthesis is done. Time-to-first-audio is the entire generation duration plus network transit; for a 5-second utterance that might be 1-3 seconds total before any audio plays.

Time-to-first-audio: 1-5 seconds typically; depends on text length and provider.
Use cases: pre-rendered prompts (IVR menus, voicemail greetings), audiobook generation, video voiceover, podcast production, accessibility playback.
Cost: sometimes slightly cheaper per character than streaming.
Implementation: dramatically simpler — one HTTP call, one audio file, done.

The trade-off table

Aspect	Streaming	Batch
Protocol	WebSocket or HTTP chunked	REST + binary response
Time to first audio	40-300ms	1-5 seconds
Required for...	Live conversation, voice AI, real-time	Pre-rendered audio, offline use
Quality difference	Same as batch	Reference
Implementation complexity	Higher (chunk handling, buffering, error recovery)	Lower (one HTTP call)
Cost difference	Same or marginally higher per character	Same or marginally lower
Bandwidth pattern	Sustained low-bandwidth	One large response download
Cache friendliness	Hard to cache — each session unique	Easy to cache by text hash

The "prerender + stream" pattern

Many production voice AI systems combine both:

Frequently-used utterances are pre-rendered with batch TTS once and cached: "Hi, you've reached X. How can I help today?", error messages, common confirmations.
Dynamic responses are streamed at runtime as they emerge from the LLM.

This gives instant playback for the common openings and seamless transitions to streamed dynamic content. The user perceives the entire experience as fast even though the first 2 seconds came from cache.

The metric that matters is first-byte latency, not generation speed. Many TTS providers advertise "real-time factor" (RTF) - how fast they generate audio relative to playback. RTF of 0.1 means 1 second of audio generates in 100ms. RTF is largely irrelevant for voice AI; what matters is when the first byte hits the wire. A provider with great RTF but a slow first chunk feels just as broken as a slow provider. Always measure first-byte time-to-audio, not aggregate generation time.

04Voice Selection and Voice Cloning

The voice you pick is one of the most consequential brand decisions in a voice AI product. It is the literal voice of your company — users will hear it more than any other piece of UI you ever build. There are three categories of choice.

Stock voices

Every TTS provider ships a library of pre-built voices — ElevenLabs has hundreds, OpenAI ships 9 named voices, Google has 380+ across languages. Stock voices are the cheapest and fastest to deploy: pick one from the catalogue, reference its ID in your API call, you're done. Stock voices are licensed for commercial use as part of your TTS subscription, so there are no separate rights questions.

Choosing a stock voice well: pick by listening to long samples (10+ seconds) reading content similar to yours, not the curated 3-second demos. Test how the voice handles your domain vocabulary — product names, place names, technical terms — because that's where stock voices most often fail. Test how it handles long sentences with subordinate clauses, where some voices lose intonation coherence.

Custom voices (fine-tuned)

Some providers (Azure Custom Neural Voice, ElevenLabs Studio, Resemble) let you train a custom voice on your own recordings. You record an actor or talent reading 30-60 minutes of script content; the provider trains a model that synthesises new content in that voice. Sounds excellent, takes weeks of voice-actor and engineering time, costs thousands to tens of thousands.

Use cases: distinctive brand voices for major companies (Amtrak's announcement voice, Australia's Service NSW), audiobooks with consistent narrator voice, IVR systems where one specific voice is the brand. Overkill for typical voice AI products; reach for stock voices first.

Voice cloning (instant)

The newest and most controversial category. Providers like ElevenLabs and Cartesia can clone a target speaker from as little as 30 seconds of clean audio. The cloned voice can then synthesise any new content in that speaker's voice and style.

Legitimate uses include:

Replicating a CEO or founder's voice for company-wide voicemail greetings.
Voice continuity for content creators (consistent voice across episodes when the creator can't always record).
Accessibility — restoring the voice of someone who has lost the ability to speak, from old recordings.
Voice acting marketplaces where actors license their voices for commercial use.

The ethics of voice cloning are real. Cloning someone's voice without their explicit consent is impersonation, regardless of how casually the technology now allows it. Reputable providers (ElevenLabs, Cartesia, Resemble) require consent attestation when cloning a voice and screen for celebrity or political figure attempts. Voice-likeness legislation is being written: the US Tennessee ELVIS Act (2024), provisions in the EU AI Act, and proposed federal NO FAKES Act all address voice impersonation. For commercial use, only clone voices where you have written consent on file, and never clone the voice of someone the listener might mistake for a real person making a real statement.

Voice consistency across sessions

One subtle issue: most TTS providers do not guarantee acoustic consistency across separate API calls. The same voice ID with the same text can produce slightly different intonation between calls. For most voice AI this is fine — humans don't sound identical between phone calls either — but for use cases like audiobook chapter consistency, audio splicing, or A/B testing, look for "deterministic" or "fixed seed" options if your provider offers them. ElevenLabs has a stability slider; Cartesia exposes generation seeds; OpenAI is mostly deterministic by default.

05Setting Up Real-Time TTS (Streaming)

Streaming TTS is the same protocol shape across providers: open a connection, send text, receive audio chunks. The differences are URL, headers, audio format and chunk schema. Here is a working ElevenLabs streaming example — the same pattern translates to Cartesia, Deepgram Aura and OpenAI streaming with minor tweaks.

The shape of a streaming TTS integration

Open a WebSocket (or chunked HTTP request) to the provider with your API key, voice ID and audio format parameters.
Send text chunks as they emerge from your LLM (or the full text in one go).
Receive audio chunks as binary WebSocket messages or HTTP body chunks. Each chunk is typically 20-200ms of audio.
Pipe each audio chunk to your output (speaker, phone leg, browser audio element) as it arrives.
Send the close-stream signal when you're done sending text.

Working JavaScript example (ElevenLabs streaming)

Real-time streaming TTS over WebSocket

const VOICE_ID = "21m00Tcm4TlvDq8ikWAM"; // pick from your voice library
const url = `wss://api.elevenlabs.io/v1/text-to-speech/${VOICE_ID}/stream-input`
  + `?model_id=eleven_turbo_v2_5`
  + `&output_format=pcm_16000`;       // raw 16kHz PCM, easy to pipe anywhere

const ws = new WebSocket(url);

ws.onopen = () => {
  // Send config first (must be first message)
  ws.send(JSON.stringify({
    text: " ",                            // single space primes the stream
    voice_settings: { stability: 0.5, similarity_boost: 0.75 },
    xi_api_key: YOUR_API_KEY
  }));

  // Now send the actual text. Send incrementally as it streams from your LLM.
  ws.send(JSON.stringify({
    text: "Hello! This is your AI receptionist. How can I help today?"
  }));

  // Final empty text closes the stream cleanly
  ws.send(JSON.stringify({ text: "" }));
};

ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.audio) {
    // base64-encoded raw PCM bytes
    const audioBytes = atob(data.audio);
    playAudioChunk(audioBytes); // your audio output function
  }
  if (data.isFinal) {
    console.log("TTS stream complete");
    ws.close();
  }
};

ws.onerror = (e) => console.error("TTS error", e);

Key parameters explained

model_id — pick the right model for your latency/quality trade-off. ElevenLabs offers eleven_turbo_v2_5 (fastest, ~250ms first byte) and eleven_multilingual_v2 (higher quality, ~500ms first byte).
output_format — what audio format you want. pcm_16000 is raw 16-bit PCM at 16kHz, easiest to pipe into telephony or browsers. mp3_44100_128 is compressed but harder to start playing instantly.
voice_settings.stability — how consistent the delivery is. Lower (0.3-0.5) = more emotional variation, higher (0.7-0.9) = more consistent across calls. For voice AI, 0.5-0.7 is a good range.
voice_settings.similarity_boost — how strictly to match the original voice. Higher = more like the source recording but can amplify artifacts. 0.75 is a sensible default.

Streaming text from an LLM into TTS

The real win of streaming TTS is being able to start synthesising audio while the LLM is still generating text. Send each token (or sentence) to TTS as it arrives from the LLM stream:

LLM → TTS pipelining (pseudo-code)

// 1. Open TTS WebSocket immediately when user finishes speaking
const ttsSocket = openTTSStream(voiceId);

// 2. Stream the LLM response, forwarding each chunk to TTS as it arrives
const llmStream = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: conversation,
  stream: true
});

let buffer = "";
for await (const chunk of llmStream) {
  const token = chunk.choices[0]?.delta?.content || "";
  buffer += token;

  // Send to TTS at sentence boundaries (clean intonation)
  if (/[.!?]\s$/.test(buffer)) {
    ttsSocket.send(JSON.stringify({ text: buffer }));
    buffer = "";
  }
}
// Flush any trailing fragment, then close the TTS stream
if (buffer) ttsSocket.send(JSON.stringify({ text: buffer }));
ttsSocket.send(JSON.stringify({ text: "" }));

This pattern collapses the perceived latency dramatically. Without it, the user waits for the full LLM response before any audio plays. With it, the user hears the AI start speaking within ~500ms of finishing their own utterance — ASR endpointing + LLM time-to-first-token + TTS first-byte time, in parallel rather than series.

Sentence-boundary chunking matters: sending text mid-sentence to TTS produces awkward intonation because the model can't predict where the sentence is heading. Split on sentence terminators (. ! ?) at minimum, or on clause boundaries (, and ;) for tighter latency at the cost of slightly more robotic phrasing. Don't split on word boundaries — that produces choppy delivery.

06Setting Up Batch TTS (File Generation)

Batch TTS is dramatically simpler than streaming — one HTTP call returns one audio file. Use it for any pre-rendered content: IVR prompts, voicemail greetings, podcast generation, video voiceover, accessibility playback, audio notifications.

Batch TTS via REST (OpenAI example)

Python: synchronous batch TTS

from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY")

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="nova",
    input="Welcome to Team-Connect. Please hold while I find an agent for you.",
    response_format="mp3",
    speed=1.0
)

with open("greeting.mp3", "wb") as f:
    f.write(response.content)

That's it. The full audio file is in greeting.mp3 after the call returns. Generation typically takes 1-3 seconds for short utterances. For longer text (over a few hundred words), use streaming response mode and pipe to disk:

Python: streaming batch TTS for long text

with client.audio.speech.with_streaming_response.create(
    model="tts-1-hd",
    voice="nova",
    input=long_text,
    response_format="mp3"
) as response:
    response.stream_to_file("output.mp3")

Cache pre-rendered audio aggressively

If you have static prompts that play frequently — greetings, hold messages, error confirmations — render them once with batch TTS, store the audio file, and serve it from disk or CDN forever. Each cached prompt saves a TTS API call per use, which adds up to real money at scale.

Python: pre-render + cache pattern

import hashlib
import os

CACHE_DIR = "/var/cache/tts"

def get_or_render(text: str, voice: str = "nova") -> str:
    """Return path to cached audio file, rendering if necessary."""
    key = hashlib.sha256(f"{voice}:{text}".encode()).hexdigest()
    path = os.path.join(CACHE_DIR, f"{key}.mp3")
    if os.path.exists(path):
        return path

    response = client.audio.speech.create(
        model="tts-1-hd", voice=voice, input=text, response_format="mp3"
    )
    with open(path, "wb") as f:
        f.write(response.content)
    return path

# Usage:
audio_path = get_or_render("Welcome to Team-Connect. How can I help?")
play_audio_file(audio_path)  # served from disk, ~1ms

Self-hosted TTS

For air-gapped, data-sovereignty or high-volume use cases, open-source TTS has matured significantly:

Python: Piper for fast on-CPU TTS

# pip install piper-tts
import wave
from piper.voice import PiperVoice

voice = PiperVoice.load("en_GB-alba-medium.onnx")

with wave.open("output.wav", "wb") as wav:
    voice.synthesize("Hello from Team-Connect.", wav)

Python: Kokoro (Apache-licensed, ~80M parameters, near-commercial quality)

# pip install kokoro
from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='b')  # 'b' = British English

generator = pipeline(
    "Hello from Team-Connect. How can I help today?",
    voice='bf_emma',
    speed=1.0
)
for i, (graphemes, phonemes, audio) in enumerate(generator):
    sf.write(f"output-{i}.wav", audio, 24000)

Other self-hosted options worth knowing

Piper — small, fast, runs on CPU and even Raspberry Pi. The default for embedded and edge.
Kokoro — 2024 release, Apache-licensed, ~80M parameters, quality close to commercial providers. The current sweet spot for self-hosting.
F5-TTS — 2024 diffusion-based model with strong voice cloning from short samples. Heavier but high-quality.
OuteTTS — uses LLM-style autoregressive token generation, supports voice cloning, MIT licensed.
Coqui XTTS-v2 — multilingual, 17 languages, voice cloning, formerly commercial now community-maintained.
Suno Bark — expressive but slow and prone to hallucination on long inputs; better for short emotional bursts than long-form.

07SSML and Prosody Control

SSML (Speech Synthesis Markup Language) is an XML-based standard for telling TTS exactly how to pronounce text. Without SSML, any non-trivial production TTS will mispronounce names, numbers, dates, addresses and abbreviations regularly. With SSML, you get precise control over pauses, emphasis, rate, pitch, volume and pronunciation.

The core SSML elements you need

Element	What it does	Example
`<break>`	Insert a pause	`Welcome.<break time="500ms"/> How can I help?`
`<emphasis>`	Stress a word or phrase	`That is <emphasis level="strong">extremely</emphasis> important.`
`<prosody>`	Change rate, pitch, volume	`<prosody rate="slow" pitch="-2st">Read this carefully.</prosody>`
`<say-as>`	Interpret as a specific type	`Your reference is <say-as interpret-as="characters">ABC123</say-as>`
`<phoneme>`	Specify exact pronunciation via IPA	`<phoneme alphabet="ipa" ph="tɒmɑːtoʊ">tomato</phoneme>`
`<sub>`	Substitute aloud reading	`<sub alias="World Health Organisation">WHO</sub>`
`<voice>`	Switch voice mid-utterance	`<voice name="emma">Hello.</voice> <voice name="ben">Hi back.</voice>`
`<lang>`	Switch language for a phrase	`I would like to order <lang xml:lang="fr-FR">croissants</lang>.`

The most useful SSML patterns in voice AI

Phone numbers and reference codes

Without SSML — recogniser may mangle

Your reference number is ABC123456.

With SSML — spelt out clearly

Your reference number is
<say-as interpret-as="characters">ABC</say-as>
<say-as interpret-as="digits">123456</say-as>.

Dates and times

Without SSML — pronunciation varies by provider

Your appointment is on 03/05/2026 at 14:30.

With SSML — explicit and consistent

Your appointment is on
<say-as interpret-as="date" format="dmy">03/05/2026</say-as>
at <say-as interpret-as="time" format="hms24">14:30</say-as>.

Custom pronunciation for proper names

Force pronunciation of difficult names

Welcome to
<phoneme alphabet="ipa" ph="tiːm kəˈnɛkt">Team-Connect</phoneme>.
You're speaking to
<phoneme alphabet="ipa" ph="mæθjuː">Mathew</phoneme>.

Pacing and natural pauses

Without SSML — rushed and robotic

I can offer Tuesday at 2pm or Thursday at 4pm. Which works better?

With SSML — humanlike pacing

I can offer Tuesday at two p.m.
<break time="400ms"/>
or Thursday at four p.m.
<break time="600ms"/>
Which works better?

Provider SSML support varies wildly

Google Cloud TTS — most complete SSML implementation; supports the full W3C spec including <mark> for time-stamping, <par> for parallel audio, etc.
Azure Speech — full SSML plus their own MSTTS extensions for emotional styles ("cheerful", "sad", "angry", "whispering").
Amazon Polly — full SSML plus Polly-specific extensions like <amazon:domain> for news/conversational styles.
OpenAI TTS — does not support SSML. Pacing is inferred from punctuation and natural language. Add commas, full stops and ellipses to influence delivery.
ElevenLabs — supports a subset (<break>, <phoneme>) and uses its own emotional cue system through bracketed annotations like [whispers] and [laughs] in the input text.
Cartesia — minimal SSML; relies on natural language and speed/emotion API parameters.

Always abstract SSML behind your own helpers. If you write SSML inline throughout your application code, switching providers becomes a rewrite. Define helper functions like spell_out(text), pause(ms), say_phone_number(num) that produce provider-appropriate output, and call those from your business logic. The day you switch from ElevenLabs to Google Cloud TTS, you change one module, not a thousand call sites.

08MOS and TTS Quality Metrics

MOS (Mean Opinion Score) is the gold-standard quality metric for synthesised speech. Like Word Error Rate is to ASR, MOS is to TTS. You should be familiar with it because every claim of "best naturalness" is built on it.

The MOS scale

MOS is a 1-to-5 score, with 1 being unintelligible and 5 being indistinguishable from human recordings. Multiple human listeners score sample audio and the scores are averaged.

Score	Quality level	Description
5	Excellent	Indistinguishable from a real human reading the same text
4	Good	Clearly synthetic but pleasant; would not annoy a listener
3	Fair	Synthetic, occasional unnaturalness, intelligible
2	Poor	Robotic, choppy, or noisy; intelligible with effort
1	Bad	Unintelligible or distorted

What 2026 TTS providers actually score

On standard English benchmark utterances:

Human reference — 4.5-4.7 (humans don't score themselves 5.0 because of natural disfluencies)
ElevenLabs Multilingual v2 — 4.4-4.6
Cartesia Sonic 2 — 4.3-4.5
OpenAI TTS-1-HD — 4.2-4.4
Google Studio voices — 4.2-4.4
Azure Neural voices — 4.1-4.3
OpenAI TTS-1 (the cheaper tier) — 3.8-4.1
Older parametric TTS — 3.0-3.5
Concatenative TTS (early 2000s) — 3.5-4.0

The top tier of modern TTS is essentially at the ceiling. You no longer get meaningfully better voice naturalness by paying more — you get features (latency, voice cloning, languages, expressiveness range).

Objective MOS proxies

Real MOS testing requires paying human listeners, which is slow and expensive. For development workflows you want automated proxies that approximate MOS:

UTMOS — neural network trained to predict MOS scores from audio. The de-facto research standard. pip install utmosv2 gives you a single function call.
DNSMOS — Microsoft's perceptual quality model, less correlated with MOS than UTMOS but easier to run.
NISQA — predicts MOS plus noisiness, colouration, discontinuity, loudness on the same 1-5 scale.
Mel-cepstral distortion (MCD) — mathematical distance from a reference. Useful for comparing two synthesised versions of the same content; less useful for absolute quality.

How to actually compare TTS providers

Public benchmarks tell you what providers achieve on standard datasets. They tell you nothing about how a provider performs on your specific use case. Always test on your own content:

Pick 20-50 utterances representative of what your voice AI will actually say. Include hard cases — long sentences, abbreviations, names, numbers, multi-language code-switching.
Generate each utterance with each candidate provider, with comparable settings (voice, model, format).
Have humans rate each one on the MOS scale, blind to provider. 5+ raters per sample minimum.
Average the scores, look at distribution not just mean. A provider with 4.4 mean MOS but 2.0 worst case is worse for production than one with 4.2 mean but 3.5 worst case.
Specifically check: does the provider mispronounce your domain vocabulary? Names, products, places. This is where providers most often fail and where SSML cannot fully save you.

Beyond MOS: the metrics that matter for voice AI. MOS measures naturalness. For voice AI you also care about (a) appropriateness for the task — a warm, friendly voice scores high MOS but might be wrong for legal disclaimers; (b) intelligibility under your audio conditions, especially if calls go through telephony codec compression; (c) consistency across calls so the same agent sounds like the same agent on Tuesday and Thursday; (d) latency, which has zero MOS contribution but matters more than 0.2 MOS to user experience. Optimise for the whole stack, not the single metric.

09TTS in a Voice AI Pipeline

TTS is the last stage of the real-time voice AI pipeline and the one users hear most directly. To make voice AI feel conversational, every stage has to hit its budget — and the user-perceived latency is dominated by the time between "they finished speaking" and "I can hear the AI starting to speak". Here is how TTS fits in.

The five-stage pipeline (recap)

Stage	Latency budget	What it does
1. Capture & transport	< 50ms	Audio leaves user device, arrives at your edge
2. Voice Activity Detection	< 30ms	Decide whether speech is happening
3. Voice Recognition (ASR)	200-300ms	Audio → text, including endpointing
4. LLM	300-500ms	Text → response text (time-to-first-token)
5. TTS	100-200ms first-byte	Response text → audio

Total budget: under 1 second from "user stops speaking" to "user hears the AI starting". TTS gets ~100-200ms for time-to-first-audio. Everything past that is parallel — subsequent audio chunks should arrive faster than playback.

The latency-saving moves that matter

Stream TTS, never batch. Batch TTS in a real-time voice AI is an automatic 1-3 second penalty. Always streaming.
Pipeline LLM → TTS. Don't wait for the LLM to finish; pipe text into TTS at sentence boundaries as it streams. This is the single biggest perceived-latency optimisation, often saving 1+ second on longer responses.
Pre-render common openers. "Hi, how can I help?" is the same 99% of the time — render it once, play from cache. Saves the entire TTS round-trip on the first turn.
Provider geography matters. If your TTS provider's region is in US-East and your call origin is UK, you add 80-100ms of round-trip just for TTS. Pick a region near your users.
Connection reuse. Opening a fresh WebSocket per turn adds 100-200ms of TLS handshake. Keep TTS connections open across turns of the same call where the API supports it (most do).
Audio format choice. Raw PCM streams instantly; MP3 needs framing. For real-time voice AI, request raw PCM at the playback sample rate.

Barge-in: the hardest TTS problem in voice AI

What happens when the user starts speaking while the AI is still talking? Two design decisions:

Detection — how do you know the user has started talking? Voice activity detection on the user's audio stream while TTS is playing. The challenge: the AI's own TTS audio might leak into the user's microphone (especially on speakerphone) and trigger false positives. Solve with acoustic echo cancellation (AEC).
Action — when you detect user speech during TTS playback, you need to stop TTS immediately. With streaming TTS this means closing the WebSocket or signalling cancel; with already-buffered audio this means flushing the playback buffer; with TTS that has already left your server but not reached the user, you need to suppress those final chunks too. Worst case is half a second of TTS audio still arriving after you stopped wanting it.

Production voice AI in 2026 is universally barge-in enabled. The technical setup: full-duplex audio (TTS playing AND ASR listening simultaneously), good AEC to subtract TTS from microphone input, and tight cancel signalling to your TTS provider. ElevenLabs, Cartesia and Deepgram Aura all expose explicit cancellation APIs — use them.

Audio buffer management

Streaming TTS produces audio chunks faster than they play out (a 300ms chunk arrives in ~50ms). You buffer chunks at the playback edge so playback is smooth even if some chunks are slow. Tuning:

Initial buffer — how much audio to accumulate before starting playback. 100-200ms is typical — enough to absorb network jitter, short enough not to delay perceived start.
Buffer floor — if buffer drains to this level, playback might stutter. Pause or slow playback if approached.
Buffer ceiling — if buffer exceeds this, you're getting too much from the provider; throttle the WebSocket if it supports it.

The integration mistake that quietly kills voice AI products: not measuring time-to-first-audio in production. Teams obsess over WER and LLM tokens-per-second but forget to instrument TTS first-byte time. Add it as a per-call metric with p50, p95, p99 percentiles. When this drifts above 300ms, your conversation feels broken regardless of how good every other component is — and you won't know unless you measure.

10Common TTS Issues and Troubleshooting

TTS failures cluster into about seven recurring patterns. Triage in this order.

Specific names or jargon mispronounced

Almost universal. "Anthropic" becomes "an tropic"; "Claude" becomes "cloud"; town names get stress on the wrong syllable. Fix in this order: (1) try alternative spellings ("Anthropic" -> "Anthropik" usually works). (2) Use SSML <phoneme> with IPA for the difficult words. (3) Use <sub alias="..."> to substitute a phonetic spelling at speak time. (4) For ElevenLabs, fine-tune a voice on examples that include your domain vocabulary. The first two cover 90% of cases.

Robotic delivery / unnatural pauses

Usually means you're feeding the TTS un-punctuated text or splitting at the wrong granularity. Add proper punctuation to your LLM output (full stops at sentence ends, commas at clause boundaries, ellipses for hesitation). For streaming, only send sentence-bounded chunks — never split mid-sentence. If your LLM produces long run-on sentences, post-process to break them.

Audio glitches at chunk boundaries

Pop, click or repeated syllable at the join between two streamed chunks. Usually a buffering or sample-rate mismatch issue. Verify the audio format your provider sends matches what you're decoding (PCM 16k vs 24k vs 48k). Check that you're concatenating raw PCM bytes correctly — if the chunks include WAV headers, you're inserting headers mid-stream which breaks decoding. Most TTS providers offer a "headerless" PCM mode for streaming — use that.

Latency spikes during normal traffic

TTS providers don't guarantee consistent first-byte latency under load. Investigate: are you hitting the per-account concurrency limit? Is your WebSocket pool exhausted? Is there geographic distance you can reduce by switching regions? Use percentile metrics (p50, p95, p99) not averages — voice AI users feel p99 latency, not the mean.

SSML being read out loud

Classic mistake: sending SSML to a provider that doesn't support it (OpenAI TTS, ElevenLabs basic API). The TTS reads the markup literally as "less than break time equals five hundred milliseconds slash greater than". Either use SSML only with supporting providers, or wrap SSML in helper functions that emit empty strings for non-supporting providers and let punctuation handle pacing.

Voice consistency drifting across calls

Same voice ID, same text, different intonation between calls. Usually a stochastic generation artifact. ElevenLabs: increase the stability setting to 0.7-0.85 for more consistency at the cost of expressiveness. Cartesia: use a fixed seed if available. OpenAI: usually deterministic by default. For audiobook-style use cases where consistency is paramount, batch-render the entire script in one call so the model has full context.

"AI voice" giveaways the user picks up on

Even excellent neural TTS has subtle signatures: slightly too smooth pacing, identical intonation on repeated phrases, no breaths, no filler words. To hide the AI nature: add occasional natural pauses ("<break time='400ms'/>"), include filler tokens ("um", "let me see"), vary phrasing across turns rather than reusing the same template, and mix in pre-recorded human snippets for crucial moments (greeting, sign-off). Conversely: in many voice AI products users prefer the AI to be obviously AI — honest disclosure is a feature, not a bug.

Audit your TTS regularly: set up a daily process that samples 20-50 random utterances from real traffic, plays them back to a human (or runs them through a MOS proxy like UTMOS), and tracks observed quality over time. TTS quality silently drifts as your domain vocabulary expands (new product names, new place names), as providers update their models (sometimes worse on edge cases you relied on), and as your conversational patterns shift. The teams that catch quality regressions early are the ones with this monitoring loop in place.

TTS Configuration FAQs

The questions our voice AI customers ask most often when planning their text-to-speech stack.

What is TTS (text-to-speech)?

TTS (text-to-speech) is the technology that converts written text into spoken audio. Modern TTS uses neural networks — typically transformer or diffusion-based architectures — to synthesise speech that is increasingly indistinguishable from human recordings. TTS is the output half of voice AI: after a language model generates a response, TTS turns that response into audio the user hears. It is the counterpart to ASR (automatic speech recognition), which handles the input direction.

What is the best TTS provider in 2026?

There is no single best - the right choice depends on your priority. For maximum naturalness and emotional expressiveness, ElevenLabs leads but costs more. For lowest latency in voice AI (40-100ms time-to-first-audio), Cartesia and Deepgram Aura are purpose-built for real-time. For cheapest at scale, OpenAI TTS-1 or AWS Polly. For maximum language coverage, Google Cloud TTS (50+ languages) or Azure Speech (140+). For self-hosted, Piper, F5-TTS or Kokoro. Most voice AI products in 2026 use ElevenLabs or Cartesia for production.

What is streaming TTS?

Streaming TTS starts emitting audio bytes as soon as the first words are synthesised, rather than waiting for the full text to complete. This collapses time-to-first-audio from seconds to 100-300ms, which is essential for voice AI feeling conversational. Streaming TTS providers like Cartesia and Deepgram Aura achieve under 100ms first-byte latency; ElevenLabs streaming runs around 200-400ms. Without streaming TTS, you cannot build a real-time voice AI — users will perceive every response as slow.

What is SSML and do I need it?

SSML (Speech Synthesis Markup Language) is an XML-based standard for controlling how TTS pronounces text. Common SSML elements include break (insert pauses), emphasis (stress words), prosody (change rate, pitch, volume), say-as (interpret text as numbers, dates, telephone numbers, currency), and phoneme (specify exact pronunciation). Most major TTS providers support a subset of SSML; ElevenLabs uses its own simpler annotation. SSML is essential for high-quality TTS in production — without it, names, numbers, dates, addresses and abbreviations are mispronounced regularly.

What is MOS (Mean Opinion Score) for TTS?

MOS is a subjective quality metric for synthesised speech, scored 1 (bad) to 5 (excellent) by human listeners. A MOS of 4.0+ is considered indistinguishable from human recordings in casual listening. Modern neural TTS providers (ElevenLabs, OpenAI TTS-1-HD, Cartesia Sonic, Google Studio) score 4.2-4.6 MOS on standard benchmarks. Older parametric TTS scored 3.0-3.5; concatenative TTS from the early 2000s scored around 3.5-4.0. MOS is the gold standard for TTS quality but expensive to run; objective proxies like UTMOS and DNSMOS approximate it cheaply for development workflows.

What latency should I expect from real-time TTS?

For real-time voice AI, you need time-to-first-audio under 200ms — any longer and the conversation feels sluggish. Cartesia Sonic and Deepgram Aura deliver 40-100ms first-byte latency. ElevenLabs streaming runs 200-400ms with their Turbo model. OpenAI TTS streaming is 300-600ms. The metric that matters is not generation speed (real-time factor) but first-byte latency — how fast the first audio chunk reaches the user's ear. After the first chunk, subsequent chunks should arrive faster than playback so the audio never stutters.

Can I clone a specific person's voice with TTS?

Yes — voice cloning is a standard feature of providers like ElevenLabs, Resemble AI, Play.ht and Cartesia. Most require 30 seconds to a few minutes of clean audio of the target speaker. Quality has improved dramatically since 2023; in 2026, cloned voices are often indistinguishable from genuine recordings. There are serious ethical and legal concerns: voice cloning without consent is impersonation. Reputable providers require explicit consent attestation, and many jurisdictions are introducing voice-likeness legislation (the US ELVIS Act, EU AI Act). For commercial use, only clone voices where you have written consent and document it.

Can I run TTS without sending text to the cloud?

Yes. Open-source TTS has improved significantly: Piper (Rhasspy project) is small and runs on CPU, useful for embedded; Kokoro is a 2024 Apache-licensed model rivalling commercial quality at ~80M parameters; F5-TTS is a 2024 diffusion model with strong voice cloning; Coqui TTS (now community-maintained) and OuteTTS are also viable. Self-hosting trades operational complexity for data sovereignty and per-character cost savings at high volumes. For most voice AI products, managed TTS is dramatically simpler and competitive on cost up to millions of characters per month.

Continue Reading

TTS is the output half of the voice AI speech I/O pair. To go deeper into the rest of the stack:

Voice Recognition Setup → Audio Codecs Explained → SIP Protocol Basics → WebRTC Integration → µ-law Encoding → AI Receptionist →