Team-Connect Engineering Guide · Updated 3 May 2026

Voice Recognition Setup

A practical engineer's guide to voice recognition setup — how speech-to-text providers compare, what streaming versus batch actually means in production, the latency budgets that make voice AI feel conversational, and the accuracy tuning that turns a 15% Word Error Rate into 5%.

Covers Deepgram, Whisper, Google, Azure, AWS · Real-time + batch · Self-hosted + managed
Jump to a section

01What is Voice Recognition?

Voice recognition — more precisely Automatic Speech Recognition (ASR) — is the technology that converts spoken audio into written text. In voice AI applications it is the front door: every customer utterance has to pass through ASR before an LLM can understand or respond to it. The quality of your ASR sets a hard ceiling on the quality of everything downstream — if ASR mishears "cancel my booking" as "consoled my booking", no amount of clever prompt engineering will recover.

Three terms that get mixed up

  • Voice recognition / ASR / STT — converts spoken words to text. The three terms are used interchangeably; "voice recognition" is the consumer term, "ASR" is academic, "STT" (speech-to-text) is the API/SaaS term.
  • Voice biometrics / speaker recognition — identifies who is speaking, not what they said. Used for authentication. Different problem entirely.
  • Voice synthesis / TTS — the reverse direction: text to spoken audio. Covered in our separate TTS configuration guide.

How modern voice recognition actually works

Until around 2017, ASR systems used a pipeline: an acoustic model (audio → phonemes), a pronunciation dictionary (phonemes → words), and a language model (words → likely sentences). Modern systems (2019 onwards) use end-to-end neural networks — typically transformer-based encoder-decoder architectures — that map audio waveforms directly to text in one model. OpenAI's Whisper, Deepgram's Nova, Google's USM, and Meta's MMS are all variations on this approach. The benefit: dramatically better accuracy, especially on accents, noisy audio and code-switching between languages. The cost: bigger models, more compute, less interpretability.

Where voice recognition fits in a voice AI stack

A typical voice AI call goes through this pipeline, in order:

  1. Audio capture — from a phone (SIP/PSTN), a browser (WebRTC), or a mobile app.
  2. Audio encoding/codec — usually Opus or G.711; if telephony then often µ-law.
  3. Voice Activity Detection (VAD) — decides when speech is happening so silence isn't transcribed.
  4. Voice recognition / ASR — turns the audio into text. This is what this guide covers.
  5. LLM — understands the user's intent and generates a response.
  6. TTS — turns the response back into audio.
  7. Audio playback — sends the audio back through the same channel.
The latency budget reality: for a voice AI to feel conversational, the user-stops-speaking to AI-starts-speaking gap should be under 1 second. Of that, voice recognition typically gets 200-300ms, the LLM gets 300-500ms, and TTS gets 100-200ms. Voice recognition is often the largest controllable variable because endpointing — detecting that the user has actually finished — can add 100-500ms by itself depending on how you tune it.

02Choosing a Voice Recognition Provider

The voice recognition market in 2026 has converged on a small group of credible providers. Word Error Rates on clean English audio are within 1-2 percentage points across all of them — the differences that matter in production are latency, language coverage, customisation depth, pricing model, and how easy the SDK is to integrate.

Honest provider comparison

ProviderStrongest atWeakest atBest for
DeepgramReal-time latency (sub-300ms), conversational English, voice AI tuning, endpointingLanguage coverage (40 languages vs Google's 125), niche-domain accuracy without custom modelsVoice AI agents, contact centres, real-time transcription
OpenAI Whisper (managed via OpenAI)Multilingual coverage, robust on accents, low cost per minuteHigher latency than streaming providers; no native real-time API (Whisper is batch-only without third-party wrappers)Batch transcription, subtitling, multilingual content
Whisper (self-hosted)Data sovereignty, predictable cost at scale, no per-minute feesOperational complexity, real-time requires extra engineering, GPU infrastructureAir-gapped deployments, healthcare, legal, government
Google Cloud Speech-to-Text125+ languages, enterprise SLAs, deep integration with other Google servicesLatency higher than Deepgram, pricing complex, model versions confusing (Chirp vs USM vs telephony models)Multilingual products, GCP-native pipelines
Azure AI SpeechMicrosoft 365 / Teams integration, custom speech for fine-tuning, enterprise complianceSDK quality varies by language, slower release cadenceMicrosoft-centric enterprises, regulated industries
Amazon TranscribeAWS-native, batch quality, good speaker diarisationReal-time added later than competitors, fewer language optionsAWS-native pipelines, recorded call analytics
AssemblyAIDiarisation, sentiment, content moderation, speaker labelsHigher latency on real-time, smaller language listPodcast/video transcription with metadata, content analysis
SpeechmaticsUK and Commonwealth accents, robust on noisy audio, on-prem optionSmaller ecosystem, less third-party toolingUK-heavy use cases, broadcast captioning

The decision framework

Rather than benchmark exhaustively, decide along three axes:

  1. Real-time or batch? If the answer is "real-time, conversational latency", you are choosing between Deepgram, AssemblyAI streaming, and Google streaming. Whisper is essentially out unless you wrap it yourself.
  2. Languages? If you need anything beyond English+major-Europeans, Google Cloud Speech and Whisper dominate. Deepgram covers 40 languages well; Azure covers 100+.
  3. Hosting? If audio cannot leave your infrastructure (healthcare, legal, certain government), self-hosted Whisper or Speechmatics on-prem are the realistic options. Everyone else is fine with a managed API.
What Team-Connect uses, and why: our voice AI agents use Deepgram for real-time call transcription. The decision drivers were sub-300ms streaming latency, a clean WebSocket API, native endpointing, and per-minute pricing that makes capacity planning predictable. We also use Whisper for offline tasks (call recording transcription, training data preparation) where batch is fine and the per-minute cost difference adds up. This kind of "streaming-managed + batch-self-hosted" split is common in voice AI operations.

03Streaming vs Batch Recognition

The single biggest architectural decision in voice recognition setup is streaming versus batch. They are different products with different APIs, different latencies, different prices and different accuracy characteristics — and choosing the wrong one breaks your application in ways no amount of tuning can fix.

Streaming (real-time) recognition

Audio flows into the ASR service as it is captured. Interim transcripts come back within hundreds of milliseconds, with final transcripts confirmed when the speaker pauses or finishes. The protocol is almost always WebSocket: you open a connection, stream audio chunks (typically 20-50ms each), and receive transcript JSON messages as they are produced.

  • Latency: 100-300ms interim, 200-500ms final.
  • Use cases: live phone calls, voice AI agents, live captions, dictation, real-time meeting transcription.
  • Cost: typically more expensive per minute than batch.
  • Accuracy: very slightly lower than batch because the model cannot use future context.

Batch recognition

You upload a complete audio file and receive a single final transcript when processing finishes. The protocol is REST: POST audio, poll for completion (or receive a webhook), GET transcript. Most providers process batch faster than real time — a 1-hour file might transcribe in 5-10 minutes.

  • Latency: minutes, not milliseconds.
  • Use cases: recorded call analysis, podcast/video transcription, meeting recap, subtitle generation, training data preparation.
  • Cost: typically 30-50% cheaper per minute than streaming.
  • Accuracy: very slightly higher than streaming because the model sees the entire utterance before deciding.

The trade-off table

AspectStreamingBatch
ProtocolWebSocket (or gRPC)REST + polling/webhook
Time to first word~200ms30 seconds to several minutes
Final transcript latency~500ms after speaker stopsRoughly 5-10% of audio duration
Required for...Live conversation, voice AIRecorded analysis, captions
Per-minute cost$0.0035-0.025$0.0015-0.015
Accuracy differenceReference+0.5 to +2% better WER
Implementation complexityHigher (audio chunking, error recovery, reconnect)Lower (one HTTP call)
Bandwidth patternSustained low-bandwidthOne large upload

The "use both" pattern

Many production voice AI systems run streaming and batch in parallel:

  • Streaming drives the live interaction — the user gets responses in real time.
  • Batch runs on the recorded call audio after the call ends — producing a higher-quality archival transcript for analytics, compliance and training.

This is more expensive than picking one but gives you live responsiveness AND archive quality. It is what most contact-centre voice AI products do internally even when they do not advertise it.

04Audio Input Requirements

The audio you feed to a speech recognition service has more impact on accuracy than any other variable. Get this layer wrong and no amount of model selection or fine-tuning will save you. Get it right and you will be operating near the model's accuracy ceiling regardless of which provider you chose.

The four parameters that matter

ParameterWhat it isBest for ASRWhat goes wrong if you don't
Sample rateSamples per second of audio16 kHz minimum, 16-48 kHz typical8 kHz (telephony) caps accuracy by losing high frequencies. Above 48 kHz wastes bandwidth.
Bit depthBits per sample16-bit PCM8-bit is too noisy for modern ASR. 24-bit is wasted detail.
ChannelsMono vs stereoMono unless you need speaker separationStereo doubles bandwidth without improving accuracy. Use channels only if separating speakers.
Codec / encodingCompression schemeWAV (PCM), FLAC, or Opus 24+ kbpsMP3 at any bitrate adds 3-15% to WER. G.711 (telephony) is fine because that's what the input was anyway.

The "good enough" recipe for voice AI

For real-time voice AI processing telephone calls:

  • Source codec: whatever the carrier sends — usually G.711 (µ-law or a-law) at 8 kHz.
  • Resample to: 16 kHz mono PCM 16-bit before sending to the recogniser. Most ASR providers will accept G.711 directly and resample internally, but doing it on your edge gives you control over the upsampling quality.
  • Chunk size: 20ms frames if you control the source, otherwise whatever the codec produces (G.711 is typically 20ms; Opus is configurable).

For real-time voice AI processing browser/app audio:

  • Capture: WebRTC at 48 kHz Opus.
  • Resample to: 16 kHz mono before sending to ASR if your provider doesn't accept Opus directly. Most do.
  • Chunk size: 20-50ms.

What about codec choice on capture?

If you are designing the capture path (not just receiving from a carrier or browser), capture in the highest-fidelity format that fits your bandwidth budget:

  • Local recording for batch transcription: WAV (16 kHz, 16-bit, mono). Largest file but reference accuracy.
  • Real-time over network: Opus at 24-32 kbps. WAV-class accuracy at 5% of the bandwidth.
  • Telephony: G.711 is what carriers will give you; resampling helps but cannot recreate frequencies above 4 kHz that telephony already discarded.
  • What to avoid: MP3 at any bitrate, AAC at low bitrates, anything below 16 kHz sample rate, anything that includes loud music or significant background noise.

For the full breakdown of codec choice on speech recognition accuracy, including specific Word Error Rate impact per codec, see our MP3 vs WAV vs Opus for speech recognition comparison.

Voice Activity Detection (VAD)

Before sending audio to ASR, run it through a VAD that detects whether speech is actually happening. Streaming ASR billed per minute will charge you for silence; sending silence also wastes the recogniser's attention and can produce hallucinated transcriptions of background noise. Browser-based vad.js and Silero VAD (Python) are the open-source standards. Most carrier-side voice platforms have VAD built in already.

The single biggest unforced error: sending stereo audio to mono-trained models. Most consumer microphones produce stereo by default. The model will average the channels (best case) or run on one and ignore the other (worst case), but in either case you have wasted bandwidth and possibly muddied the signal. Always confirm channel layout matches what your provider expects.

05Setting Up Real-Time Recognition (Streaming)

Streaming voice recognition is essentially the same protocol shape across providers: open a WebSocket, stream audio chunks, receive transcript JSON. The differences are URL, headers, audio format and the exact JSON schema. Here is a working example using Deepgram — the same pattern translates to Google, Azure or AssemblyAI with minor tweaks.

The shape of a streaming integration

  1. Open a WebSocket to the provider with your API key and audio parameters.
  2. Stream raw audio bytes (or encoded frames) as binary WebSocket messages.
  3. Receive transcript messages as JSON. Each message is either interim (might still change) or final (committed).
  4. On user pause, the provider emits an endpoint event — "I think they have finished speaking" — which you use to trigger downstream actions.
  5. Close the WebSocket when the call ends.

Working JavaScript example (Deepgram-style)

Real-time streaming voice recognition over WebSocket

// Open the recogniser WebSocket
const url = new URL('wss://api.deepgram.com/v1/listen');
url.searchParams.set('encoding', 'linear16');
url.searchParams.set('sample_rate', '16000');
url.searchParams.set('channels', '1');
url.searchParams.set('model', 'nova-3');
url.searchParams.set('interim_results', 'true');
url.searchParams.set('endpointing', '300'); // ms of silence = end-of-utterance
url.searchParams.set('language', 'en-GB');

const ws = new WebSocket(url.toString(), ['token', YOUR_API_KEY]);

ws.onopen = () => {
  console.log('ASR socket open');
  // Now start streaming audio chunks
};

ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.type === 'Results') {
    const alt = data.channel.alternatives[0];
    if (data.is_final) {
      handleFinalTranscript(alt.transcript);
    } else {
      handleInterimTranscript(alt.transcript);
    }
  } else if (data.type === 'UtteranceEnd') {
    onUserStoppedSpeaking();
  }
};

ws.onerror = (e) => console.error('ASR error', e);
ws.onclose = () => console.log('ASR socket closed');

// Stream audio frames as they arrive (e.g. from a MediaStream)
function sendAudioFrame(audioBuffer) {
  if (ws.readyState === WebSocket.OPEN) {
    ws.send(audioBuffer); // raw 16-bit PCM bytes
  }
}

Key parameters explained

  • encoding — tells the recogniser what bytes you are sending. linear16 is raw 16-bit PCM; alternatives include opus, mulaw (G.711 µ-law), flac, amr-wb.
  • sample_rate — must match the actual audio. Mismatched sample rates produce gibberish transcripts.
  • model — pick the right one for your domain. nova-3 is general-purpose; telephony, medical and meeting variants exist.
  • interim_results — whether you want partial transcripts as the user speaks, or only finals. For voice AI, always true — you can show "thinking" UI based on interim updates.
  • endpointing — how many milliseconds of silence trigger an end-of-utterance event. 300ms is a good default; lower feels snappier but cuts users off, higher feels slower.
  • language — BCP-47 code (en-GB, en-US, es-ES...). Wrong language code is a common silent accuracy killer.

The endpointing trade-off

Endpointing is the largest user-perceptible variable in voice AI. Too aggressive (endpoint after 100ms of silence) and the AI cuts users off mid-sentence whenever they pause to think. Too conservative (endpoint after 1500ms) and the AI feels sluggish and unresponsive. The sweet spot for most voice AI applications is 300-500ms of silence, with smarter providers offering "voice activity detection" features that distinguish thinking pauses from end-of-utterance pauses using semantic cues. Deepgram's utterance_end_ms and Google's "endless utterance" features both attempt this.

The hidden cost of streaming: opening and holding a WebSocket per concurrent call ties up file descriptors and memory on your server. At scale (1000+ concurrent calls), the connection pool itself becomes a capacity constraint. Plan for this with connection pooling, graceful shutdown, and monitoring of WebSocket lifecycle events — transcript quality is irrelevant if the connection silently dies mid-call.

06Setting Up File-Based Recognition (Batch)

Batch recognition is dramatically simpler than streaming. You POST audio to an endpoint and either get a transcript back synchronously (small files) or receive a job ID and poll/webhook when it's done (large files). Most providers offer both modes; pick async for anything over a few minutes.

Synchronous batch (small files, sub-1-minute)

Python: synchronous batch transcription via REST

import requests

API_KEY = "YOUR_API_KEY"
url = "https://api.deepgram.com/v1/listen"
params = {
    "model": "nova-3",
    "language": "en-GB",
    "punctuate": "true",
    "diarize": "true",       # speaker labels
    "smart_format": "true"   # numbers, dates, money formatted readably
}
headers = {"Authorization": f"Token {API_KEY}"}

with open("call_recording.wav", "rb") as f:
    response = requests.post(
        url,
        params=params,
        headers=headers,
        data=f,
        # the request library will set Content-Type from the file
        # but it's safer to be explicit:
        # headers["Content-Type"] = "audio/wav"
    )

result = response.json()
transcript = result["results"]["channels"][0]["alternatives"][0]["transcript"]
print(transcript)

Asynchronous batch (large files, podcasts, recorded calls)

For files over a minute or two, switch to async. You either pass a callback URL (webhook) or poll a job-status endpoint. The pattern is the same across providers.

Python: async batch with webhook callback

import requests

API_KEY = "YOUR_API_KEY"
url = "https://api.deepgram.com/v1/listen"
params = {
    "model": "nova-3",
    "callback": "https://your-app.example.com/asr-done",
    "punctuate": "true",
    "diarize": "true"
}
headers = {"Authorization": f"Token {API_KEY}"}

with open("3-hour-podcast.flac", "rb") as f:
    response = requests.post(url, params=params, headers=headers, data=f)

# Returns immediately with a request_id; transcript will POST to your webhook
print(response.json())  # {"request_id": "...", "request_status": "processing"}

# Your webhook handler:
# @app.post("/asr-done")
# def handle_asr_done(request):
#     payload = request.json
#     transcript = payload["results"]["channels"][0]["alternatives"][0]["transcript"]
#     save_to_database(transcript, request_id=payload["metadata"]["request_id"])

Self-hosted Whisper

For air-gapped or data-sovereignty cases, OpenAI Whisper has open weights and runs locally. Whisper-large-v3 needs roughly 10 GB VRAM for real-time inference and produces accuracy competitive with managed providers. Whisper-tiny runs on a Raspberry Pi (slowly).

Python: self-hosted Whisper via the openai-whisper package

import whisper

# Load the model once at startup; takes 5-30 seconds depending on size
model = whisper.load_model("large-v3")  # alternatives: tiny, base, small, medium

# Transcribe a file
result = model.transcribe(
    "call_recording.wav",
    language="en",
    task="transcribe",      # or "translate" for translation to English
    fp16=False,             # set True if running on GPU with half-precision support
    word_timestamps=True    # get per-word timing
)

print(result["text"])
for segment in result["segments"]:
    print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")

Other self-hosted options worth knowing

  • faster-whisper — CTranslate2-based reimplementation, 4x faster than the reference Whisper at the same accuracy. The default for production self-hosted Whisper.
  • whisper.cpp — pure C++ port that runs on CPU at usable speed. Good for embedded and edge devices.
  • NVIDIA NeMo / Riva — production-grade self-hosted streaming with a commercial-friendly licence; needs NVIDIA GPUs.
  • Vosk — small, fast, runs on CPU and even mobile. Older architecture (acoustic + language models) so accuracy is lower than Whisper, but operationally simpler.
  • SeamlessM4T — Meta's multilingual model, strong on translation and rare languages.

07Word Error Rate (WER) Explained

Word Error Rate (WER) is the standard quality metric for speech recognition. It measures how often the transcript disagrees with what was actually said, and you should be familiar with the formula because every claim of "best accuracy" you read online is built on it.

The WER formula

WER counts three kinds of error and divides by the reference word count:

WER = (Substitutions + Deletions + Insertions) / Total reference words

  • Substitution — one word replaced by another. "I went to the store" → "I went to the floor".
  • Deletion — a word in the reference is missing from the hypothesis. "I went to the store" → "I went to store".
  • Insertion — a word in the hypothesis is not in the reference. "I went to the store" → "I went to the the store".

A WER of 5% means 5 errors per 100 reference words. WER below 0% is impossible; WER above 100% is possible (more insertions than reference words).

What "good" WER looks like in 2026

Audio conditionExcellentGoodAcceptableProblematic
Clean studio English (read speech)< 3%3-5%5-8%> 8%
Conversational English, low noise< 5%5-8%8-12%> 12%
Telephony (8 kHz)< 8%8-12%12-18%> 18%
Heavy accent or noisy environment< 12%12-18%18-25%> 25%
Multiple speakers, overlap< 15%15-22%22-30%> 30%

For voice AI in production, you want to be in the "Good" column at worst on your typical audio conditions. Below 10% WER is the threshold where the LLM downstream can usually recover from misrecognitions; above 15% you start seeing user-visible failures regularly.

How to actually measure WER on your traffic

Public benchmarks tell you what providers achieve on standard datasets (LibriSpeech, Common Voice, TED-LIUM). They tell you nothing about how a provider performs on your audio. Always run your own benchmark before committing.

  1. Collect 30-100 representative audio clips from your real traffic. Make sure they include the accents, background noise and domain vocabulary you actually encounter.
  2. Have a human transcribe each one accurately — this is your reference.
  3. Run each provider's API on the same clips, with the same settings, and capture each transcript.
  4. Compute WER between each provider's output and the reference using a library like jiwer (Python) or sclite (NIST tooling).
  5. Look at not just the average but the distribution — a provider with 6% mean WER but 20% on 1 in 10 clips is worse for production than one with 8% mean WER and 12% worst case.

Python: computing WER with jiwer

from jiwer import wer

reference = "Hello I'd like to book an appointment for next Tuesday at three"
hypothesis = "Hello I'd like to book a appointment for next Tuesday at three pm"

error_rate = wer(reference, hypothesis)
print(f"WER: {error_rate:.2%}")  # WER: 16.67%
Beware vendor benchmarks: every vendor's website claims best-in-class WER. They are usually quoting on a dataset chosen to flatter their model. The benchmark that matters is yours, on your audio, against your domain vocabulary. Spending one engineering day building this is worth more than a week reading vendor blog posts.

08Improving Recognition Accuracy

Once you have a baseline WER measured on your traffic, the next question is how to lower it. There are five techniques in roughly descending order of impact — the first two account for 80% of realistic gains.

1. Capture better audio (the highest-leverage fix)

Half the time the answer is not "switch model" but "fix the input". Common audio improvements that move WER 3-8 percentage points:

  • Move from 8 kHz to 16 kHz sample rate (where the signal chain allows).
  • Replace MP3 capture with WAV or Opus.
  • Apply noise reduction at capture time (RNNoise, Krisp SDK).
  • Use a directional microphone or noise-cancelling headset for in-person scenarios.
  • Reduce echo with acoustic echo cancellation (AEC); WebRTC has this built in.

2. Custom vocabulary / keyword boosting

Tell the recogniser the domain-specific words it might hear. Product names, place names, customer names, technical jargon, brand names — ASR models do not have these in their training data and will mistranscribe them ("Anthropic" → "an tropic", "Claude" → "cloud") unless you provide them.

Deepgram-style keyword boosting

params = {
    "model": "nova-3",
    "keyterm": [
        "Team-Connect",
        "DadLink Technologies",
        "Macclesfield",
        "ADHD Brain Scan",
        "qEEG"
    ]
}

Each provider has its own term for this: Deepgram calls it keyterm or keywords, Google calls it "speech contexts" or "phrase hints", Azure calls it "phrase lists", AssemblyAI calls it "word boost". They all do roughly the same thing: nudge the language model towards recognising those specific words.

3. Choose a domain-tuned model

Most providers offer pre-trained domain variants:

  • Telephony / contact-centre — tuned for 8 kHz audio and conversational patterns. Significantly better than the default model on phone calls.
  • Medical — tuned for medical terminology. Useful in healthcare voice products.
  • Legal — tuned for legal terminology and dictation patterns.
  • Meeting — tuned for multi-speaker conversations and meeting jargon.
  • Video / broadcast — tuned for produced media with music and effects.

Switching from a general model to a telephony model on phone-call audio typically lowers WER by 2-5 percentage points at no extra cost.

4. Custom fine-tuning

Some providers let you fine-tune a model on your own labelled audio (Azure Custom Speech, Google Adaptation, Deepgram custom models, self-hosted Whisper LoRA). This is high effort — you need labelled audio in the thousands of utterances range — but can move WER 5-15 percentage points on heavily domain-specific traffic. Worth doing if vocabulary boosting and audio improvements have plateaued and you still need more accuracy.

5. Post-processing

After the ASR has done its best, fix the residue with deterministic logic:

  • Spell-correct against your customer database. If the recogniser produces a name not in your CRM, find the closest fuzzy match.
  • Number / date / phone formatting — turn "oh seven seven seven five..." into "07775..." with regex.
  • Domain term canonicalisation — "team connect" / "team-connect" / "team-connect dot co dot uk" all become the same canonical entity.
  • LLM-based correction — for the highest-stakes outputs, run the transcript through a small LLM with a prompt like "fix any obvious transcription errors based on context" before downstream use.
Stop optimising at the right point: WER below 5% on your real traffic means you have probably hit the ceiling of what model improvements can give you. Beyond that point, accuracy gains come from application logic — better prompt design, retrieval grounding, conversation repair flows ("did you say Thursday or Tuesday?") — not from squeezing the recogniser further. Diminishing returns set in fast.

09Voice Recognition in a Voice AI Pipeline

Voice recognition is one component in a five-stage real-time pipeline. To make voice AI feel conversational, every stage has to hit its latency budget and they all have to compose without queueing on each other. Here is the pipeline as it actually runs in production.

The five-stage pipeline

StageLatency budgetWhat it doesFailure mode
1. Capture & transport< 50msAudio leaves user device, arrives at your edgeNetwork jitter, packet loss in WebRTC path
2. Voice Activity Detection< 30msDecide whether speech is happeningCuts off start of user speech, or wastes ASR on silence
3. Voice Recognition200-300msAudio → text, including endpointingHigh WER, slow endpointing
4. LLM300-500msText → response textCold-start latency, retrieval time
5. TTS100-200ms first-byteResponse text → audioSlow first audio packet, robotic delivery

Total budget: under 1 second from "user stops speaking" to "user hears the AI starting to respond". Above 1.5 seconds the conversation feels broken; users start re-speaking, talking over the AI, or hanging up.

Where ASR setup choices affect the budget

  • Endpointing time — the silence threshold that triggers "user is done" is the largest controllable variable. 300ms is a good default; consider 200ms for short-question domains and 500ms for storytelling-style users.
  • Streaming vs batch — batch is non-starter for real-time. Streaming is mandatory.
  • Provider geography — if your ASR provider's region is in US-East and your call origin is UK, you are adding 80-100ms of round-trip just for the ASR roundtrip. Pick a provider region near your users.
  • Connection reuse — opening a fresh WebSocket per call adds 100-200ms of TLS+handshake. For high-volume systems, pool connections or use HTTP/2 streams that multiplex.
  • Pre-emption with interim results — you can sometimes start the LLM call on the interim transcript while the final is still being confirmed, hiding 100-200ms of latency. This requires careful design to avoid sending the LLM a half-formed sentence.

The "barge-in" problem

What happens when the user starts speaking while the AI is talking? Two options:

  • Barge-in enabled — the user's voice immediately interrupts AI playback, the AI stops mid-sentence and listens. Feels natural but requires careful echo cancellation so the AI doesn't trigger its own ASR with its own TTS output.
  • No barge-in — the AI finishes its sentence regardless. Simpler to build but feels rude in long-winded responses.

Production voice AI in 2026 is universally barge-in enabled. The technical requirement is full-duplex audio (you are sending TTS to the user AND streaming user audio to ASR simultaneously) plus very good acoustic echo cancellation to subtract the TTS signal from what the user's mic picks up. Most modern voice AI platforms (including Team-Connect's) handle this transparently — barge-in just works once the AEC is correctly configured.

The frequent-flier mistake: testing a voice AI pipeline by reading a script from text-to-speech back into your own ASR. Self-talk has clean audio, no background noise, predictable cadence and zero accents. Your real users have none of those things. Always validate latency and accuracy on real or near-real audio — phone calls from a noisy office, mobile-on-the-train, accented English — before declaring victory.

10Common Voice Recognition Issues and Troubleshooting

Voice recognition failures cluster into about seven recurring patterns. Triage in this order.

WER is much higher than the vendor benchmarks claim

Almost always means the audio you are sending is not what the model was trained on. Check sample rate (often discovers a 8 kHz vs 16 kHz mismatch), channel count (stereo being sent to a mono model), and codec (MP3 capture is a silent killer). Compare a known-good test file from the vendor's own benchmark suite — if accuracy on that is fine and accuracy on yours is bad, the gap is your input.

Specific names or jargon constantly mistranscribed

Add them to keyword boost / phrase hints / custom vocabulary. This is the single most cost-effective accuracy fix — minutes of engineering work for several percentage points of WER on the words your users actually care about.

Latency spikes during normal traffic

Streaming ASR providers do not give consistent latency under load. Investigate: are you hitting the provider's per-second concurrency limit? Is your WebSocket pool exhausted? Is there geographic distance you can reduce by switching regions? Use percentile metrics (p50, p95, p99) not averages — voice AI users feel the p99 latency, not the mean.

Accent failures (regional / non-native English)

Most providers ship default models trained on US English with secondary support for UK English. Switch to the locale-specific model (en-GB, en-IN, en-AU) where available; for non-native English, "international" or "world" models often work better than US-default. If your users are heavily accented and you hit a quality floor, Whisper-large is often the best off-the-shelf choice for accented English because of its broad training data.

Hallucinated transcripts during silence

Some models — Whisper especially — hallucinate plausible-sounding sentences when fed silence or background music. Mitigations: pre-filter audio with VAD so silence never reaches the recogniser, set no_speech_threshold tighter on Whisper, or post-filter the output by checking the model's own confidence score per segment.

The transcript looks fine but the LLM acts confused

Two common causes. First, formatting: ASR may emit "ok ok hi how are you um yeah ok" which technically is what was said but is hard for an LLM to parse. Enable smart-formatting and disambiguation features (Deepgram's smart_format, Google's enable_automatic_punctuation). Second, partial transcripts: if you accidentally feed the LLM an interim transcript instead of a final, you may be sending half-finished sentences. Audit the boundary between ASR and LLM carefully.

Connection drops / silent failure during long calls

WebSocket connections to ASR providers are usually capped at 1-2 hours of continuous use. For longer calls (long support sessions, recorded webinar transcription) you need to handle reconnect cleanly: open a new socket before the old one expires, drain the old transcript, switch to the new socket. Most provider SDKs do this for you; raw WebSocket integrations have to implement it manually.

Audit your transcripts regularly: set up a daily process that samples 10-50 random transcripts from real traffic, has a human spot-check them, and tracks observed WER over time. ASR quality silently drifts as your user base changes (new accents), as providers update their models (sometimes worse on edge cases), and as your traffic patterns shift (new device types, new locations). The teams that catch quality regressions early are the ones with this monitoring loop in place.

Voice Recognition Setup FAQs

The questions our voice AI customers ask most often when planning their speech recognition stack.

What is voice recognition?

Voice recognition (more precisely Automatic Speech Recognition, or ASR) is the technology that converts spoken audio into written text. Modern systems use neural networks — typically transformer-based encoder-decoder models — to map audio waveforms directly to text without separate acoustic and language model stages. ASR is distinct from voice biometrics (which identifies who is speaking) and from voice synthesis (text-to-speech, the reverse direction). For voice AI applications, ASR is the front door: every customer utterance has to pass through it before an LLM can understand or respond.

What is the best voice recognition provider in 2026?

There is no single best provider — the right choice depends on your use case. For real-time voice AI with sub-300ms latency, Deepgram and AssemblyAI lead. For maximum language coverage, Google Cloud Speech supports 125+ languages. For self-hosted or air-gapped deployments, OpenAI Whisper (open weights) is the default. For tight Microsoft 365 integration, Azure Speech. For AWS-native pipelines, Amazon Transcribe. Word Error Rates on clean English are within 1-2 percentage points across all leading providers; differences in latency, language support, customisation and pricing matter more in practice.

What is Word Error Rate (WER)?

Word Error Rate is the standard quality metric for speech recognition. It measures (substitutions + deletions + insertions) divided by the total number of words in the reference transcript, expressed as a percentage. A WER of 5% means 5 out of every 100 words were transcribed incorrectly. State-of-the-art systems achieve 4-7% WER on clean conversational English, 10-15% on accented or noisy audio, and 20-30% on heavily accented or low-resource languages. WER below 10% is generally good enough for voice AI; below 5% is excellent.

What is the difference between streaming and batch speech recognition?

Streaming (real-time) recognition processes audio as it arrives, returning interim and final transcripts within hundreds of milliseconds. It is essential for voice AI conversations where the user expects an immediate reply. Batch recognition processes a complete audio file end-to-end, returning a single final transcript when done; it is faster overall, more accurate (the model can use full context), and cheaper per minute, but unsuitable for live interaction. Choose streaming for live calls and voice agents; choose batch for transcribing recorded calls, podcasts, meetings or video files.

What audio format is best for speech recognition?

For maximum accuracy, capture in WAV (uncompressed PCM) at 16 kHz sample rate, 16-bit, mono. For real-time streaming where bandwidth matters, Opus at 24-32 kbps is the modern best choice — it gives accuracy within 1.5% of WAV at a fraction of the file size. Avoid MP3 if accuracy matters; it discards spectral detail that ASR models rely on. Telephone audio at 8 kHz works but caps accuracy because the model loses high-frequency information. For deeper detail on codec choice, see our audio codecs comparison.

How do I improve speech recognition accuracy?

Five techniques in roughly order of impact: (1) capture better audio — 16 kHz minimum, mono, low background noise. (2) Provide a custom vocabulary or keyword boost listing domain-specific terms (product names, place names, technical jargon). (3) Choose a domain-tuned model if your provider offers one (medical, legal, contact-centre variants exist). (4) For supported providers, fine-tune on your own labelled audio. (5) Add post-processing: spell-correction against your customer database, regex normalisation of numbers and times. The first two account for 80% of realistic accuracy gains.

What latency should I expect from real-time speech recognition?

Streaming ASR providers in 2026 deliver interim results within 100-300ms and finalised results within 200-500ms after the user stops speaking. For a conversational voice AI, target an end-to-end response (user finishes speaking to TTS audio playing back) under 1 second; this gives ASR a budget of roughly 200-300ms, the LLM 300-500ms, and TTS 100-200ms. Anything over 1.5 seconds feels broken to users. Endpointing — the decision of when the user has finished speaking — is the largest controllable variable.

Can I run voice recognition without sending audio to the cloud?

Yes. OpenAI Whisper has open weights and runs on local GPU or even CPU (slower). Whisper-large-v3 needs roughly 10 GB of VRAM for real-time inference; Whisper-tiny runs on a Raspberry Pi at acceptable accuracy. Other self-hostable options include NVIDIA NeMo (Riva) and Vosk. Self-hosting trades operational complexity for data sovereignty: it makes sense for healthcare, legal, government and regulated finance use cases where audio cannot leave your infrastructure, but for most consumer voice AI a managed provider is dramatically simpler and competitive on cost up to high volumes.

Continue Reading

Voice recognition is one stage of the voice AI pipeline. To go deeper into the others:

Audio Codecs Explained → TTS Configuration → SIP Protocol Basics → WebRTC Integration → µ-law Encoding → AI Receptionist →