🎵 Audio Technology Deep Dive

Audio Codecs Explained

Master audio codecs for voice AI applications. Complete guide covering compression, quality, performance, and implementation for speech recognition and synthesis systems.

📚
Audio Codec Fundamentals

What Are Audio Codecs?

An audio codec (coder-decoder) is a computer program that encodes or decodes digital audio data. Codecs compress audio files to save storage space and bandwidth while maintaining acceptable quality levels.

🧠 Key Codec Concepts
  • Compression Ratio: How much the file size is reduced
  • Bitrate: Amount of data processed per second (kbps)
  • Sample Rate: Number of audio samples per second (Hz)
  • Bit Depth: Number of bits per sample (8, 16, 24, 32-bit)
  • Latency: Delay introduced by encoding/decoding process
  • Quality: How closely the output matches the original

Lossless vs. Lossy Compression

Aspect Lossless Lossy
Quality Perfect reconstruction Some quality loss
File Size Larger (2:1 to 3:1 compression) Much smaller (10:1 to 20:1)
Use Cases Archival, professional audio Streaming, mobile, storage
Processing Power Low to moderate Moderate to high
Examples FLAC, ALAC, WAV MP3, AAC, Opus, Vorbis

Codec Selection for Voice AI

When choosing codecs for voice AI applications, consider these factors:

⚖️ Selection Criteria
  • Recognition Accuracy: Some codecs work better with speech recognition
  • Latency Requirements: Real-time applications need low-latency codecs
  • Bandwidth Constraints: Mobile/streaming apps may need high compression
  • Device Compatibility: Ensure broad device support
  • Processing Power: Consider encoding/decoding CPU requirements
  • License Costs: Some codecs require licensing fees

🔄
Lossless Audio Codecs

Perfect Quality Preservation

Lossless codecs compress audio without any quality loss. The decoded audio is bit-for-bit identical to the original. Perfect for archival and professional applications.

WAV (PCM)

Uncompressed / Lossless Container
1411 kbps (CD Quality)
1:1 Compression
  • Universal compatibility
  • No compression artifacts
  • Professional standard
  • Supports various bit depths
  • Real-time processing
Professional Recording Reference Audio Voice AI Training Large File Size

FLAC

Free Lossless Audio Codec
700-900 kbps Average
2:1 Compression
  • Open source and royalty-free
  • Excellent compression efficiency
  • Supports up to 32-bit/192kHz
  • Metadata support
  • Error detection
Audio Archival High-Quality Storage Voice Analysis Open Source

ALAC

Apple Lossless Audio Codec
600-800 kbps Average
2.5:1 Compression
  • Apple ecosystem integration
  • Good compression ratio
  • Fast encoding/decoding
  • iTunes/Apple Music support
  • Hardware acceleration on Apple devices
Apple Devices iTunes Library Apple Ecosystem Only
Working with Lossless Codecs in Python
import soundfile as sf import librosa import numpy as np from pathlib import Path class LosslessAudioHandler: def __init__(self): self.supported_formats = { 'wav': {'extension': '.wav', 'codec': 'PCM'}, 'flac': {'extension': '.flac', 'codec': 'FLAC'}, 'aiff': {'extension': '.aiff', 'codec': 'PCM'} } def load_audio(self, file_path, target_sr=None): """Load audio file with format detection""" try: # Load audio with librosa (handles most formats) audio_data, sample_rate = librosa.load(file_path, sr=target_sr, mono=False) # Get file info info = sf.info(file_path) return { 'audio': audio_data, 'sample_rate': sample_rate, 'channels': info.channels, 'frames': info.frames, 'duration': info.duration, 'format': info.format, 'subtype': info.subtype } except Exception as e: return {'error': str(e)} def convert_to_wav(self, input_file, output_file, sample_rate=16000, bit_depth=16): """Convert any audio format to WAV""" try: # Load the audio audio_data, sr = librosa.load(input_file, sr=None, mono=False) # Resample if needed if sr != sample_rate: audio_data = librosa.resample(audio_data, orig_sr=sr, target_sr=sample_rate) # Determine subtype based on bit depth subtype_map = { 16: 'PCM_16', 24: 'PCM_24', 32: 'PCM_32' } subtype = subtype_map.get(bit_depth, 'PCM_16') # Save as WAV sf.write(output_file, audio_data.T if audio_data.ndim > 1 else audio_data, sample_rate, subtype=subtype) return {'success': True, 'output': output_file} except Exception as e: return {'error': str(e)} def convert_to_flac(self, input_file, output_file, compression_level=5): """Convert audio to FLAC format""" try: # Load audio audio_data, sample_rate = librosa.load(input_file, sr=None, mono=False) # Save as FLAC with compression level (0-8, higher = better compression) sf.write(output_file, audio_data.T if audio_data.ndim > 1 else audio_data, sample_rate, format='FLAC', subtype=f'PCM_16') return {'success': True, 'output': output_file} except Exception as e: return {'error': str(e)} def analyze_quality(self, original_file, processed_file): """Compare original and processed audio quality""" try: # Load both files orig_audio, orig_sr = librosa.load(original_file, sr=None) proc_audio, proc_sr = librosa.load(processed_file, sr=None) # Ensure same sample rate if orig_sr != proc_sr: proc_audio = librosa.resample(proc_audio, orig_sr=proc_sr, target_sr=orig_sr) # Ensure same length min_length = min(len(orig_audio), len(proc_audio)) orig_audio = orig_audio[:min_length] proc_audio = proc_audio[:min_length] # Calculate metrics mse = np.mean((orig_audio - proc_audio) ** 2) # Signal-to-Noise Ratio signal_power = np.mean(orig_audio ** 2) noise_power = mse snr = 10 * np.log10(signal_power / (noise_power + 1e-10)) # Peak Signal-to-Noise Ratio max_possible = np.max(np.abs(orig_audio)) ** 2 psnr = 10 * np.log10(max_possible / (mse + 1e-10)) return { 'mse': float(mse), 'snr_db': float(snr), 'psnr_db': float(psnr), 'identical': mse < 1e-10 # Essentially zero difference } except Exception as e: return {'error': str(e)} def get_file_info(self, file_path): """Get detailed information about audio file""" try: info = sf.info(file_path) file_size = Path(file_path).stat().st_size # Calculate bitrate duration = info.duration bitrate = (file_size * 8) / (duration * 1000) if duration > 0 else 0 # kbps return { 'filename': Path(file_path).name, 'format': info.format, 'subtype': info.subtype, 'channels': info.channels, 'sample_rate': info.samplerate, 'frames': info.frames, 'duration': duration, 'file_size_mb': file_size / (1024 * 1024), 'bitrate_kbps': bitrate, 'bit_depth': self._get_bit_depth(info.subtype) } except Exception as e: return {'error': str(e)} def _get_bit_depth(self, subtype): """Extract bit depth from subtype""" if '16' in subtype: return 16 elif '24' in subtype: return 24 elif '32' in subtype: return 32 else: return 'Unknown' # Usage examples if __name__ == "__main__": handler = LosslessAudioHandler() # Load and analyze audio audio_info = handler.load_audio("input.wav") print("Audio Info:", audio_info) # Convert to different lossless formats wav_result = handler.convert_to_wav("input.mp3", "output.wav", sample_rate=44100) print("WAV Conversion:", wav_result) flac_result = handler.convert_to_flac("input.wav", "output.flac") print("FLAC Conversion:", flac_result) # Analyze quality (should be identical for lossless) quality = handler.analyze_quality("original.wav", "converted.flac") print("Quality Analysis:", quality) # Get detailed file information file_info = handler.get_file_info("audio.flac") print("File Info:", file_info)

📉
Lossy Audio Codecs

Efficient Compression with Quality Tradeoffs

Lossy codecs achieve high compression ratios by removing audio information that's considered less important to human perception. Modern lossy codecs provide excellent quality at reasonable bitrates.

MP3 (MPEG-1 Layer 3)

Lossy / Legacy Standard
128-320 kbps Range
10:1 Compression
  • Universal compatibility
  • Mature, stable format
  • Hardware support everywhere
  • Low CPU requirements
  • Predictable file sizes
Legacy Systems Broad Compatibility Outdated Quality Patent Issues

AAC

Advanced Audio Coding
128-256 kbps Optimal
12:1 Compression
  • Better quality than MP3
  • Efficient at low bitrates
  • Multiple variants (LC, HE, etc.)
  • Apple ecosystem standard
  • Streaming optimized
Streaming Audio Mobile Apps Apple Devices YouTube/Spotify

Ogg Vorbis

Open Source Lossy
128-500 kbps Range
Variable VBR Optimized
  • Completely open source
  • No licensing fees
  • Better quality than MP3
  • Variable bitrate support
  • Unlimited channel support
Open Source Projects Gaming Audio Spotify Limited Mobile Support

Modern Lossy Codec Comparison

Codec Quality (128kbps) Compression Efficiency Latency CPU Usage License
MP3 Good ⭐⭐⭐ Low Very Low Patent encumbered
AAC-LC Very Good ⭐⭐⭐⭐ Low Low Patent encumbered
Opus Excellent ⭐⭐⭐⭐⭐ Ultra Low Medium Royalty-free
Ogg Vorbis Very Good ⭐⭐⭐⭐ Medium Medium Open source
Lossy Codec Implementation with FFmpeg
import subprocess import os import json from pathlib import Path class LossyCodecConverter: def __init__(self): self.ffmpeg_path = "ffmpeg" # Assumes ffmpeg is in PATH # Codec configurations for different use cases self.presets = { 'voice_low': { 'mp3': ['-c:a', 'libmp3lame', '-b:a', '64k', '-ar', '16000', '-ac', '1'], 'aac': ['-c:a', 'aac', '-b:a', '64k', '-ar', '16000', '-ac', '1'], 'opus': ['-c:a', 'libopus', '-b:a', '32k', '-ar', '16000', '-ac', '1'] }, 'voice_high': { 'mp3': ['-c:a', 'libmp3lame', '-b:a', '128k', '-ar', '22050', '-ac', '1'], 'aac': ['-c:a', 'aac', '-b:a', '96k', '-ar', '22050', '-ac', '1'], 'opus': ['-c:a', 'libopus', '-b:a', '64k', '-ar', '24000', '-ac', '1'] }, 'music_standard': { 'mp3': ['-c:a', 'libmp3lame', '-b:a', '192k', '-ar', '44100'], 'aac': ['-c:a', 'aac', '-b:a', '128k', '-ar', '44100'], 'opus': ['-c:a', 'libopus', '-b:a', '128k', '-ar', '48000'] }, 'music_high': { 'mp3': ['-c:a', 'libmp3lame', '-b:a', '320k', '-ar', '44100'], 'aac': ['-c:a', 'aac', '-b:a', '256k', '-ar', '44100'], 'opus': ['-c:a', 'libopus', '-b:a', '192k', '-ar', '48000'] } } def convert(self, input_file, output_file, codec, preset='voice_high', custom_args=None): """Convert audio using lossy codec""" try: if custom_args: codec_args = custom_args else: codec_args = self.presets.get(preset, {}).get(codec, []) if not codec_args: return {'error': f'No preset found for {codec} with {preset}'} # Build FFmpeg command cmd = [ self.ffmpeg_path, '-i', input_file, '-y', # Overwrite output file *codec_args, output_file ] # Execute conversion result = subprocess.run( cmd, capture_output=True, text=True, check=True ) return { 'success': True, 'output_file': output_file, 'command': ' '.join(cmd) } except subprocess.CalledProcessError as e: return { 'error': f'FFmpeg error: {e.stderr}', 'command': ' '.join(cmd) } except Exception as e: return {'error': str(e)} def batch_convert(self, input_file, output_dir, codecs=['mp3', 'aac', 'opus'], preset='voice_high'): """Convert to multiple formats for comparison""" results = {} output_path = Path(output_dir) output_path.mkdir(exist_ok=True) input_stem = Path(input_file).stem for codec in codecs: if codec == 'mp3': output_file = output_path / f"{input_stem}.mp3" elif codec == 'aac': output_file = output_path / f"{input_stem}.m4a" elif codec == 'opus': output_file = output_path / f"{input_stem}.opus" else: output_file = output_path / f"{input_stem}.{codec}" result = self.convert(input_file, str(output_file), codec, preset) results[codec] = result if result.get('success'): # Add file size info file_size = os.path.getsize(output_file) results[codec]['file_size_mb'] = file_size / (1024 * 1024) return results def quality_test(self, input_file, output_dir, bitrates=[64, 128, 192, 256, 320]): """Test codec quality at different bitrates""" results = {} output_path = Path(output_dir) output_path.mkdir(exist_ok=True) input_stem = Path(input_file).stem for bitrate in bitrates: for codec in ['mp3', 'aac', 'opus']: # Custom arguments for specific bitrate if codec == 'mp3': args = ['-c:a', 'libmp3lame', '-b:a', f'{bitrate}k'] ext = '.mp3' elif codec == 'aac': args = ['-c:a', 'aac', '-b:a', f'{bitrate}k'] ext = '.m4a' elif codec == 'opus': # Opus has different bitrate ranges opus_bitrate = min(bitrate, 256) # Opus max is ~256k args = ['-c:a', 'libopus', '-b:a', f'{opus_bitrate}k'] ext = '.opus' output_file = output_path / f"{input_stem}_{codec}_{bitrate}k{ext}" result = self.convert( input_file, str(output_file), codec, custom_args=args ) if result.get('success'): file_size = os.path.getsize(output_file) results[f"{codec}_{bitrate}k"] = { 'codec': codec, 'bitrate': bitrate, 'file_size_mb': file_size / (1024 * 1024), 'output_file': str(output_file) } return results def analyze_compression(self, original_file, compressed_files): """Analyze compression efficiency and quality""" try: # Get original file info original_size = os.path.getsize(original_file) original_info = self._get_audio_info(original_file) analysis = { 'original': { 'file_size_mb': original_size / (1024 * 1024), 'duration': original_info.get('duration', 0), 'bitrate_kbps': original_info.get('bit_rate', 0) / 1000 }, 'compressed': {} } for name, file_path in compressed_files.items(): if os.path.exists(file_path): compressed_size = os.path.getsize(file_path) compressed_info = self._get_audio_info(file_path) compression_ratio = original_size / compressed_size space_saving = ((original_size - compressed_size) / original_size) * 100 analysis['compressed'][name] = { 'file_size_mb': compressed_size / (1024 * 1024), 'compression_ratio': round(compression_ratio, 2), 'space_saving_percent': round(space_saving, 1), 'bitrate_kbps': compressed_info.get('bit_rate', 0) / 1000 } return analysis except Exception as e: return {'error': str(e)} def _get_audio_info(self, file_path): """Get audio file information using ffprobe""" try: cmd = [ 'ffprobe', '-v', 'quiet', '-print_format', 'json', '-show_format', '-show_streams', file_path ] result = subprocess.run(cmd, capture_output=True, text=True, check=True) data = json.loads(result.stdout) # Extract audio stream info audio_stream = None for stream in data.get('streams', []): if stream.get('codec_type') == 'audio': audio_stream = stream break if audio_stream: return { 'codec': audio_stream.get('codec_name'), 'sample_rate': int(audio_stream.get('sample_rate', 0)), 'channels': audio_stream.get('channels', 1), 'bit_rate': int(audio_stream.get('bit_rate', 0)), 'duration': float(data.get('format', {}).get('duration', 0)) } return {} except Exception as e: return {'error': str(e)} # Usage example if __name__ == "__main__": converter = LossyCodecConverter() # Convert to multiple formats results = converter.batch_convert( 'input.wav', 'output/', codecs=['mp3', 'aac', 'opus'], preset='voice_high' ) print("Batch conversion results:") for codec, result in results.items(): if result.get('success'): print(f" {codec}: {result['file_size_mb']:.2f} MB") else: print(f" {codec}: Failed - {result.get('error')}") # Quality test at different bitrates quality_results = converter.quality_test('input.wav', 'quality_test/') print(f"\nQuality test completed: {len(quality_results)} files generated") # Analyze compression compressed_files = { 'mp3_128k': 'output/test_128k.mp3', 'aac_128k': 'output/test_128k.m4a', 'opus_128k': 'output/test_128k.opus' } analysis = converter.analyze_compression('input.wav', compressed_files) print(f"\nCompression Analysis: {analysis}")

🎙️
Voice-Optimized Codecs

Specialized for Speech Applications

Voice-optimized codecs are specifically designed for speech rather than music. They achieve excellent compression for voice while maintaining intelligibility and compatibility with telephony systems.

G.711 (μ-law/A-law)

ITU-T Standard / Telephony
64 kbps Fixed
8 kHz Sample Rate
  • Universal telephony support
  • Very low latency
  • Hardware implementations
  • PSTN compatibility
  • Simple encoding/decoding
VoIP Systems PBX Integration Legacy Support Limited Bandwidth

G.729

Low Bitrate Voice Codec
8 kbps
8:1 vs G.711
  • Excellent compression for voice
  • Bandwidth efficient
  • Good voice quality
  • Widely supported in VoIP
  • Frame-based processing
Satellite Links Low Bandwidth Mobile Networks Patent Restrictions

AMR-NB/WB

Adaptive Multi-Rate
4.75-23.85 kbps Range
Adaptive Bitrate
  • Adaptive bitrate based on conditions
  • Excellent for mobile networks
  • Error resilience
  • Wideband version available
  • 3GPP standard
Mobile Voice GSM/UMTS VoLTE Poor Networks

Skype SILK

VoIP Optimized
6-40 kbps Range
Variable Complexity
  • Optimized for internet voice
  • Packet loss resilience
  • Low delay operation
  • Wideband and super-wideband
  • Used in Opus codec
Skype Calls VoIP Applications Part of Opus Real-time Communication

Voice Codec Selection Guide

🎯 Choose Based on Your Needs
  • Traditional Telephony: G.711 (μ-law in North America, A-law in Europe)
  • Bandwidth Limited: G.729 or AMR-NB for maximum compression
  • Mobile Applications: AMR-NB/WB for adaptive quality
  • VoIP/Internet: Opus (includes SILK) for best quality and efficiency
  • Speech Recognition: Prefer lossless or high-bitrate codecs
  • Real-time Communication: Low-latency codecs (G.711, Opus)
Voice Codec Processing for Speech Recognition
import numpy as np import librosa import soundfile as sf from scipy import signal import webrtcvad import collections class VoiceCodecProcessor: def __init__(self): self.vad = webrtcvad.Vad(2) # Voice Activity Detection, mode 2 (moderate) # Codec-specific preprocessing settings self.codec_settings = { 'g711_mulaw': { 'sample_rate': 8000, 'bit_depth': 8, 'preprocess': 'telephone_band' }, 'g729': { 'sample_rate': 8000, 'frame_size': 80, # 10ms frames 'preprocess': 'speech_enhancement' }, 'amr_nb': { 'sample_rate': 8000, 'adaptive_rate': True, 'preprocess': 'noise_reduction' }, 'opus_voice': { 'sample_rate': 16000, 'frame_size': 320, # 20ms frames at 16kHz 'preprocess': 'minimal' } } def optimize_for_speech_recognition(self, audio_data, sample_rate, target_codec='opus_voice'): """Optimize audio for speech recognition based on codec characteristics""" # Get codec settings codec_config = self.codec_settings.get(target_codec, self.codec_settings['opus_voice']) target_sr = codec_config['sample_rate'] # Resample if needed if sample_rate != target_sr: audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=target_sr) sample_rate = target_sr # Apply preprocessing based on codec preprocess_type = codec_config['preprocess'] if preprocess_type == 'telephone_band': # Bandpass filter for telephone frequency range (300-3400 Hz) audio_data = self._apply_telephone_filter(audio_data, sample_rate) elif preprocess_type == 'speech_enhancement': # Enhanced preprocessing for low-bitrate codecs audio_data = self._speech_enhancement(audio_data, sample_rate) elif preprocess_type == 'noise_reduction': # Noise reduction suitable for adaptive codecs audio_data = self._noise_reduction(audio_data, sample_rate) # Voice Activity Detection and silence removal audio_data = self._remove_silence(audio_data, sample_rate) # Normalize audio level audio_data = self._normalize_audio(audio_data) return audio_data, sample_rate def _apply_telephone_filter(self, audio_data, sample_rate): """Apply telephone bandpass filter (300-3400 Hz)""" nyquist = sample_rate / 2 low = 300 / nyquist high = 3400 / nyquist b, a = signal.butter(4, [low, high], btype='band') return signal.filtfilt(b, a, audio_data) def _speech_enhancement(self, audio_data, sample_rate): """Enhanced preprocessing for speech clarity""" # High-pass filter to remove low-frequency noise nyquist = sample_rate / 2 high_pass_freq = 80 / nyquist b, a = signal.butter(2, high_pass_freq, btype='high') audio_data = signal.filtfilt(b, a, audio_data) # Gentle compression to even out dynamics audio_data = self._soft_compression(audio_data) # De-emphasis filter (common in telephony) # H(z) = 1 - 0.95 * z^(-1) audio_data = signal.lfilter([1, -0.95], [1], audio_data) return audio_data def _noise_reduction(self, audio_data, sample_rate): """Basic noise reduction using spectral subtraction""" # Simple spectral subtraction # Estimate noise from first 0.5 seconds (assumed to be silence/noise) noise_duration = min(int(0.5 * sample_rate), len(audio_data) // 4) noise_segment = audio_data[:noise_duration] # Compute noise spectrum noise_fft = np.fft.fft(noise_segment) noise_magnitude = np.abs(noise_fft) noise_power = noise_magnitude ** 2 # Frame-based processing frame_length = 2048 hop_length = 512 # STFT D = librosa.stft(audio_data, n_fft=frame_length, hop_length=hop_length) magnitude = np.abs(D) phase = np.angle(D) # Spectral subtraction alpha = 2.0 # Over-subtraction factor # Extend noise spectrum to match signal frames noise_spectrum = np.mean(noise_power[:magnitude.shape[0]]) # Subtract noise clean_magnitude = magnitude - alpha * noise_spectrum # Ensure non-negative values clean_magnitude = np.maximum(clean_magnitude, 0.1 * magnitude) # Reconstruct signal clean_D = clean_magnitude * np.exp(1j * phase) clean_audio = librosa.istft(clean_D, hop_length=hop_length) return clean_audio def _soft_compression(self, audio_data, threshold=0.5, ratio=3.0): """Apply soft compression to audio""" # Simple soft knee compressor abs_audio = np.abs(audio_data) # Find samples above threshold above_threshold = abs_audio > threshold # Apply compression compressed = audio_data.copy() compressed[above_threshold] = np.sign(audio_data[above_threshold]) * ( threshold + (abs_audio[above_threshold] - threshold) / ratio ) return compressed def _remove_silence(self, audio_data, sample_rate, frame_duration_ms=30): """Remove silence using Voice Activity Detection""" # Convert to appropriate format for VAD (16-bit PCM) audio_16bit = (audio_data * 32767).astype(np.int16) frame_length = int(sample_rate * frame_duration_ms / 1000) # Ensure frame length is compatible with VAD if frame_length not in [160, 320, 480]: # 10ms, 20ms, 30ms at 16kHz frame_length = 320 # Default to 20ms frames = [] speech_frames = [] # Process in frames for i in range(0, len(audio_16bit) - frame_length, frame_length): frame = audio_16bit[i:i + frame_length] # Check if frame contains speech try: is_speech = self.vad.is_speech(frame.tobytes(), sample_rate) frames.append(frame) speech_frames.append(is_speech) except: # If VAD fails, assume it's speech frames.append(frame) speech_frames.append(True) # Keep only speech frames + small buffer around speech buffer_frames = 2 # Keep 2 frames before/after speech # Find speech segments speech_segments = [] in_speech = False segment_start = 0 for i, is_speech in enumerate(speech_frames): if is_speech and not in_speech: segment_start = max(0, i - buffer_frames) in_speech = True elif not is_speech and in_speech: segment_end = min(len(frames), i + buffer_frames) speech_segments.append((segment_start, segment_end)) in_speech = False # Handle case where speech continues to the end if in_speech: speech_segments.append((segment_start, len(frames))) # Reconstruct audio from speech segments if speech_segments: speech_audio = [] for start, end in speech_segments: segment_frames = frames[start:end] segment_audio = np.concatenate(segment_frames) speech_audio.append(segment_audio) result_audio = np.concatenate(speech_audio).astype(np.float32) / 32767 else: # If no speech detected, return original (might be a detection error) result_audio = audio_data return result_audio def _normalize_audio(self, audio_data, target_rms=0.1): """Normalize audio to target RMS level""" # Calculate current RMS current_rms = np.sqrt(np.mean(audio_data ** 2)) if current_rms > 0: # Calculate normalization factor normalization_factor = target_rms / current_rms # Apply normalization with peak limiting normalized = audio_data * normalization_factor # Ensure we don't clip max_val = np.max(np.abs(normalized)) if max_val > 0.95: normalized = normalized * (0.95 / max_val) return normalized else: return audio_data def analyze_codec_suitability(self, audio_file, codecs=['g711_mulaw', 'g729', 'opus_voice']): """Analyze which codec would work best for the given audio""" # Load audio audio_data, sample_rate = librosa.load(audio_file, sr=None) results = {} for codec in codecs: # Process audio for this codec processed_audio, processed_sr = self.optimize_for_speech_recognition( audio_data.copy(), sample_rate, codec ) # Analyze characteristics analysis = { 'codec': codec, 'target_sample_rate': processed_sr, 'duration': len(processed_audio) / processed_sr, 'rms_level': np.sqrt(np.mean(processed_audio ** 2)), 'peak_level': np.max(np.abs(processed_audio)), 'dynamic_range': self._calculate_dynamic_range(processed_audio), 'spectral_centroid': np.mean(librosa.feature.spectral_centroid(y=processed_audio, sr=processed_sr)), 'zero_crossing_rate': np.mean(librosa.feature.zero_crossing_rate(processed_audio)) } # Estimate quality retention analysis['quality_score'] = self._estimate_quality_score(analysis) results[codec] = analysis return results def _calculate_dynamic_range(self, audio_data): """Calculate dynamic range in dB""" rms = np.sqrt(np.mean(audio_data ** 2)) peak = np.max(np.abs(audio_data)) if rms > 0: return 20 * np.log10(peak / rms) else: return 0 def _estimate_quality_score(self, analysis): """Estimate codec quality score based on audio characteristics""" score = 100 # Start with perfect score # Penalize based on various factors codec = analysis['codec'] # Sample rate penalties if analysis['target_sample_rate'] < 16000: score -= 20 # Significant penalty for narrow band # Dynamic range penalties dr = analysis['dynamic_range'] if dr < 20: score -= 15 # Low dynamic range elif dr > 60: score -= 10 # Too high dynamic range might cause issues # Level penalties if analysis['rms_level'] < 0.01: score -= 25 # Too quiet elif analysis['rms_level'] > 0.5: score -= 15 # Too loud # Codec-specific adjustments if codec == 'g711_mulaw': score -= 10 # Inherent quality limitation elif codec == 'g729': score -= 20 # High compression penalty elif codec == 'opus_voice': score += 10 # Modern codec bonus return max(0, min(100, score)) # Usage example if __name__ == "__main__": processor = VoiceCodecProcessor() # Analyze codec suitability analysis = processor.analyze_codec_suitability('speech_sample.wav') print("Codec Suitability Analysis:") for codec, results in analysis.items(): print(f"\n{codec.upper()}:") print(f" Quality Score: {results['quality_score']:.1f}/100") print(f" Sample Rate: {results['target_sample_rate']} Hz") print(f" RMS Level: {results['rms_level']:.3f}") print(f" Dynamic Range: {results['dynamic_range']:.1f} dB") # Process audio for specific codec audio_data, sr = librosa.load('speech_sample.wav') optimized_audio, optimized_sr = processor.optimize_for_speech_recognition( audio_data, sr, 'opus_voice' ) # Save optimized audio sf.write('optimized_speech.wav', optimized_audio, optimized_sr)

📡
Streaming & Real-Time Codecs

Optimized for Real-Time Communication

Streaming codecs prioritize low latency and network resilience over maximum compression efficiency. They're designed for real-time applications like VoIP, video conferencing, and live streaming.

Opus

Modern Universal Codec
6-510 kbps Range
2.5-40 ms Latency
  • Ultra-low latency capability
  • Excellent quality at all bitrates
  • Adaptive to network conditions
  • Royalty-free and open standard
  • Combines SILK and CELT technologies
WebRTC VoIP Discord Real-time Gaming

G.722

Wideband Audio Codec
64 kbps
16 kHz Sample Rate
  • Wideband audio (50Hz-7kHz)
  • Better than G.711 quality
  • Same bitrate as G.711
  • Low computational complexity
  • HD Voice standard
HD Voice Conference Systems VoIP Upgrade Business Phones

Speex

Legacy VoIP Codec
2.15-44.2 kbps Range
Variable Quality Modes
  • Designed specifically for speech
  • Multiple quality modes
  • Packet loss resilience
  • Echo cancellation support
  • Open source implementation
Legacy VoIP Mumble Superseded by Opus Open Source

Real-Time Codec Considerations

Latency vs Quality Tradeoff
  • Ultra-Low Latency (< 10ms): Essential for gaming, live music
  • Low Latency (10-40ms): Good for VoIP, video calls
  • Medium Latency (40-100ms): Acceptable for most applications
  • High Latency (> 100ms): Noticeable delay, avoid for real-time
Codec Min Latency Packet Loss Resilience Bandwidth Adaptation CPU Usage
Opus 2.5ms Excellent Yes Low-Medium
G.711 0.125ms Poor No Very Low
G.722 1ms Poor No Low
G.729 10ms Good No Medium
Speex 20ms Good Limited Medium

💡
Codec Selection Best Practices

Decision Framework

Choosing the right audio codec depends on your specific requirements. Use this decision tree to guide your selection:

🎯 Codec Selection Decision Tree
  1. Primary Use Case:
    • Speech Recognition → Lossless (WAV/FLAC) or high-bitrate lossy
    • Real-time Communication → Opus, G.711, G.722
    • Music/Entertainment → AAC, Opus, MP3
    • Archival/Professional → FLAC, WAV
  2. Quality Requirements:
    • Perfect Quality → Lossless codecs only
    • High Quality → AAC 256k+, Opus 128k+
    • Good Quality → AAC 128k, Opus 64k, MP3 192k
    • Acceptable Quality → Voice codecs, MP3 128k
  3. Latency Requirements:
    • Ultra-Low (< 10ms) → G.711, Opus with low latency settings
    • Low (10-40ms) → Opus, G.722
    • Normal (> 40ms) → Any codec acceptable
  4. Bandwidth Constraints:
    • Very Limited → G.729, AMR-NB, Opus at low bitrates
    • Limited → Opus, AAC-HE, MP3
    • Unlimited → Any codec, prefer quality

Implementation Tips

Codec Selection Helper
class CodecSelector: def __init__(self): self.codec_profiles = { 'speech_recognition': { 'recommended': ['wav', 'flac', 'opus_high'], 'acceptable': ['aac_256', 'opus_128'], 'avoid': ['mp3_128', 'g729', 'amr'] }, 'real_time_voice': { 'recommended': ['opus_voice', 'g711', 'g722'], 'acceptable': ['speex', 'g729'], 'avoid': ['aac', 'mp3', 'flac'] }, 'mobile_streaming': { 'recommended': ['aac_he', 'opus_mobile', 'amr_wb'], 'acceptable': ['aac_lc', 'mp3_vbr'], 'avoid': ['wav', 'flac', 'g711'] }, 'archival': { 'recommended': ['flac', 'wav', 'alac'], 'acceptable': ['aac_256'], 'avoid': ['mp3', 'opus', 'lossy_codecs'] } } self.codec_specs = { 'wav': {'type': 'lossless', 'latency': 'ultra_low', 'bandwidth': 'high'}, 'flac': {'type': 'lossless', 'latency': 'low', 'bandwidth': 'medium_high'}, 'opus_voice': {'type': 'lossy', 'latency': 'ultra_low', 'bandwidth': 'low'}, 'opus_high': {'type': 'lossy', 'latency': 'low', 'bandwidth': 'medium'}, 'aac_256': {'type': 'lossy', 'latency': 'medium', 'bandwidth': 'medium'}, 'g711': {'type': 'lossy', 'latency': 'ultra_low', 'bandwidth': 'medium'}, 'g729': {'type': 'lossy', 'latency': 'low', 'bandwidth': 'very_low'}, 'mp3_320': {'type': 'lossy', 'latency': 'medium', 'bandwidth': 'medium'} } def recommend_codec(self, use_case, quality_priority='medium', latency_requirement='medium', bandwidth_limit='medium'): """ Recommend codec based on requirements Args: use_case: 'speech_recognition', 'real_time_voice', 'mobile_streaming', 'archival' quality_priority: 'low', 'medium', 'high', 'maximum' latency_requirement: 'low', 'medium', 'high' (high = strict low latency) bandwidth_limit: 'very_low', 'low', 'medium', 'high', 'unlimited' """ recommendations = [] # Get base recommendations for use case profile = self.codec_profiles.get(use_case, self.codec_profiles['real_time_voice']) candidates = profile['recommended'] + profile['acceptable'] # Filter based on requirements for codec in candidates: codec_info = self.codec_specs.get(codec, {}) score = 100 # Start with perfect score # Quality scoring if quality_priority == 'maximum' and codec_info.get('type') != 'lossless': score -= 30 elif quality_priority == 'high' and 'low' in codec: score -= 20 elif quality_priority == 'low' and codec_info.get('type') == 'lossless': score -= 15 # Overkill for low quality needs # Latency scoring codec_latency = codec_info.get('latency', 'medium') if latency_requirement == 'high': # Need low latency if codec_latency == 'ultra_low': score += 20 elif codec_latency == 'low': score += 10 elif codec_latency == 'medium': score -= 20 else: score -= 40 # Bandwidth scoring codec_bandwidth = codec_info.get('bandwidth', 'medium') bandwidth_penalties = { 'very_low': {'high': -50, 'medium_high': -40, 'medium': -20}, 'low': {'high': -30, 'medium_high': -20, 'medium': -10}, 'medium': {'high': -10}, 'high': {}, 'unlimited': {} } penalty = bandwidth_penalties.get(bandwidth_limit, {}).get(codec_bandwidth, 0) score += penalty # Avoid codecs that are explicitly not recommended if codec in profile.get('avoid', []): score -= 50 recommendations.append({ 'codec': codec, 'score': max(0, score), 'rationale': self._generate_rationale(codec, codec_info, score) }) # Sort by score and return top recommendations recommendations.sort(key=lambda x: x['score'], reverse=True) return recommendations[:3] # Top 3 recommendations def _generate_rationale(self, codec, codec_info, score): """Generate human-readable rationale for codec recommendation""" reasons = [] if codec_info.get('type') == 'lossless': reasons.append("perfect audio quality") if codec_info.get('latency') == 'ultra_low': reasons.append("minimal latency") elif codec_info.get('latency') == 'low': reasons.append("low latency") if codec_info.get('bandwidth') == 'low': reasons.append("efficient bandwidth usage") elif codec_info.get('bandwidth') == 'very_low': reasons.append("very low bandwidth requirements") if score >= 90: quality = "Excellent choice" elif score >= 70: quality = "Good choice" elif score >= 50: quality = "Acceptable choice" else: quality = "Not recommended" return f"{quality}: {', '.join(reasons) if reasons else 'meets basic requirements'}" # Usage example if __name__ == "__main__": selector = CodecSelector() # Example: Voice AI application recommendations = selector.recommend_codec( use_case='speech_recognition', quality_priority='high', latency_requirement='medium', bandwidth_limit='medium' ) print("Codec Recommendations for Speech Recognition:") for i, rec in enumerate(recommendations, 1): print(f"{i}. {rec['codec'].upper()}") print(f" Score: {rec['score']}/100") print(f" Rationale: {rec['rationale']}") print() # Example: Real-time communication realtime_recs = selector.recommend_codec( use_case='real_time_voice', quality_priority='medium', latency_requirement='high', bandwidth_limit='low' ) print("Codec Recommendations for Real-time Voice:") for i, rec in enumerate(realtime_recs, 1): print(f"{i}. {rec['codec'].upper()}") print(f" Score: {rec['score']}/100") print(f" Rationale: {rec['rationale']}") print()

Quality vs Compression Summary

Use Case Primary Codec Alternative Avoid Notes
Voice AI Training FLAC/WAV AAC 256k+ MP3, G.729 Quality critical for model training
Real-time VoIP Opus G.711, G.722 AAC, MP3 Latency is priority
Mobile Voice Apps Opus, AMR-WB AAC-HE Lossless codecs Bandwidth efficiency needed
Podcast/Streaming AAC Opus, MP3 Voice-only codecs Balance quality and size
Telephony Integration G.711 G.729 Modern codecs Legacy compatibility required