Team-Connect Engineering Guide · Updated 3 May 2026

WebRTC Integration

Q: What is WebRTC in simple terms?

WebRTC (Web Real-Time Communications) is a free, open standard that lets browsers and mobile apps send audio, video and arbitrary data peer-to-peer with no plugins required. It is built into every modern browser since 2017. WebRTC handles the hard parts of real-time communications — codec negotiation, NAT traversal, encryption, packet loss recovery — and exposes them through a small JavaScript API: getUserMedia for capture, RTCPeerConnection for the call, RTCDataChannel for arbitrary data.

Q: What is the difference between WebRTC and SIP?

SIP is a signaling protocol used between dedicated VoIP infrastructure — phones, PBXs, gateways, carriers. WebRTC is a complete browser-side real-time stack with no defined signaling protocol. SIP requires SIP infrastructure to work. WebRTC works in any browser with a few lines of JavaScript and a signaling channel of your choice (typically WebSocket carrying JSON). Most modern voice platforms run both: SIP between servers and the PSTN, WebRTC for browser endpoints, with a gateway translating between them.

Q: What is the difference between SFU and MCU in WebRTC?

An SFU (Selective Forwarding Unit) is a server that receives one media stream from each participant and forwards copies to the others without decoding or mixing them. An MCU (Multipoint Control Unit) decodes all incoming streams, mixes them into a single composite stream, and sends one mixed stream to each participant. SFUs are cheaper and let clients render their own UI; MCUs use less client bandwidth but are CPU-expensive. For 2-50 participant calls SFU is almost always the right answer; beyond 50 the trade-offs get interesting and some systems use a hybrid.

Q: Is WebRTC encrypted by default?

Yes — WebRTC mandates encryption. All media flows over SRTP keyed by DTLS (DTLS-SRTP), and the keys are generated fresh for each session via a DTLS handshake performed inside the media path. There is no way to disable encryption in standards-compliant WebRTC implementations. The signaling channel you choose, however, is your responsibility — getUserMedia also requires HTTPS (or localhost) on every modern browser, so the page hosting WebRTC code must be served securely or the API will refuse to grant camera/microphone access.

Q: How do I integrate WebRTC into my application?

At a minimum: (1) call getUserMedia to capture the local camera/mic stream; (2) create an RTCPeerConnection and add the local stream to it; (3) call createOffer, setLocalDescription, and send the resulting SDP to the remote peer via your signaling channel; (4) on the remote side, setRemoteDescription with the offer, createAnswer, setLocalDescription, and send the answer back; (5) trickle ICE candidates between peers as they are gathered; (6) attach the remote stream to a video element when it arrives via the ontrack event. Add HTTPS, a STUN server, and ideally a TURN server for production, and you have a working WebRTC integration.

A practical engineer's guide to WebRTC integration — how RTCPeerConnection actually works, why ICE/STUN/TURN are mandatory, the signalling decisions you have to make yourself, and where WebRTC fits next to SIP in 2026.

W3C + IETF standard · Built into every modern browser · Includes SFU vs MCU architectures

Jump to a section

01 · Foundations

What is WebRTC?

Definition, history, where it sits in the real-time stack.

02 · API

The WebRTC JavaScript API

getUserMedia, RTCPeerConnection, RTCDataChannel with real code.

03 · Signalling

WebRTC Signalling

Why WebRTC has none, and what to use instead.

04 · Networking

ICE, STUN & TURN

NAT traversal explained, with the realistic 80/20 split.

05 · Negotiation

Media Negotiation

Offer/answer with SDP, codec selection, trickle ICE.

06 · Security

WebRTC Security

DTLS-SRTP mandatory, HTTPS, getUserMedia consent.

07 · Scaling

P2P, SFU and MCU

When mesh works, when you need a server, what the trade-offs are.

08 · Compare

WebRTC vs SIP

When to use each, how to bridge them in production.

09 · Debug

Troubleshooting

getUserMedia denied, ICE failed, one-way audio, codec mismatch.

10 · FAQ

WebRTC FAQs

The questions our voice AI customers ask most often.

01What is WebRTC?

WebRTC (Web Real-Time Communications) is a free, open standard that lets browsers and mobile apps send audio, video and arbitrary data peer-to-peer with no plugins required. It is built into every modern browser since 2017, baked into iOS and Android via libwebrtc, and forms the audio/video plumbing of products as varied as Google Meet, Discord, Microsoft Teams, Slack huddles, Zoom (in part), Twitter Spaces, Houseparty (RIP), and a long list of customer-support tools, video doorbells and game streaming services.

Why WebRTC exists

Before WebRTC, putting voice or video in a browser meant Flash, Silverlight, or a custom plugin. Each had different security models, codecs, and platform support, and none worked on mobile. WebRTC was a deliberate effort by Google, Mozilla and the IETF (chartered in 2011, first stable in 2013, fully standardised in 2021 as W3C Recommendation) to put a complete real-time stack inside the browser as a JavaScript API: capture, encode, encrypt, transmit, receive, decrypt, render — all of it — with no install step.

What WebRTC actually gives you

Capture — the getUserMedia() API to access the camera, microphone and screen.
Codecs — Opus for audio, VP8/VP9/H.264/AV1 for video, all negotiated automatically.
Transport — SRTP for media (encrypted, packet-loss-resilient, jitter-buffered).
NAT traversal — ICE/STUN/TURN built in, mandatory, no manual configuration.
Encryption — DTLS-SRTP mandatory; you cannot disable it.
Data channels — RTCDataChannel for arbitrary peer-to-peer data over SCTP/DTLS.

What WebRTC explicitly does NOT give you

Signalling — how the two peers find each other and exchange the offer/answer SDP. WebRTC says "use whatever you want": WebSocket, HTTP polling, MQTT, Firebase, SIP-over-WebSocket. Pick one.
Authentication or identity — that is your application's job.
Multi-party media routing — for more than 2-3 participants you need server-side help (an SFU or MCU). WebRTC alone is point-to-point.
Recording or storage — you can capture the local streams to disk, but anything cloud-side requires extra infrastructure.

The right mental model: WebRTC is the media plane. Everything else — signalling, auth, presence, multi-party routing, recording — is your application's responsibility. This is the opposite of SIP, which standardises signalling and is silent on browser implementation. The two protocols complement each other rather than competing.

02The WebRTC JavaScript API

The entire WebRTC API for a basic peer-to-peer call boils down to three objects: MediaStream (what you capture), RTCPeerConnection (the call), and optionally RTCDataChannel (arbitrary data). Here is a minimum-viable WebRTC integration as actual JavaScript.

Step 1: Capture local media with getUserMedia

Request camera and microphone access

// Returns a Promise that resolves to a MediaStream
const localStream = await navigator.mediaDevices.getUserMedia({
  audio: true,
  video: { width: { ideal: 1280 }, height: { ideal: 720 } }
});

// Show the local preview
document.getElementById('localVideo').srcObject = localStream;

The browser will prompt the user for permission. If they deny, the Promise rejects with NotAllowedError. If there is no camera, NotFoundError. getUserMedia only works on HTTPS (or localhost for development); on plain HTTP the API does not exist on the page at all.

Step 2: Create the peer connection

Set up RTCPeerConnection with ICE servers

const peerConnection = new RTCPeerConnection({
  iceServers: [
    { urls: 'stun:stun.l.google.com:19302' },
    {
      urls: 'turn:turn.example.com:3478',
      username: 'webrtc',
      credential: 'YOUR_TURN_PASSWORD'
    }
  ]
});

// Add every track from the local stream to the connection
localStream.getTracks().forEach(track => {
  peerConnection.addTrack(track, localStream);
});

// Handle remote tracks when they arrive
peerConnection.ontrack = (event) => {
  document.getElementById('remoteVideo').srcObject = event.streams[0];
};

// Handle ICE candidates as they are discovered (trickle ICE)
peerConnection.onicecandidate = (event) => {
  if (event.candidate) {
    signalingChannel.send({ type: 'candidate', candidate: event.candidate });
  }
};

Step 3: The offer/answer dance via your signalling channel

Caller side: create and send the offer

// CALLER
const offer = await peerConnection.createOffer();
await peerConnection.setLocalDescription(offer);
signalingChannel.send({ type: 'offer', sdp: offer.sdp });

// CALLEE (somewhere else)
signalingChannel.onMessage = async (msg) => {
  if (msg.type === 'offer') {
    await peerConnection.setRemoteDescription({ type: 'offer', sdp: msg.sdp });
    const answer = await peerConnection.createAnswer();
    await peerConnection.setLocalDescription(answer);
    signalingChannel.send({ type: 'answer', sdp: answer.sdp });
  } else if (msg.type === 'answer') {
    await peerConnection.setRemoteDescription({ type: 'answer', sdp: msg.sdp });
  } else if (msg.type === 'candidate') {
    await peerConnection.addIceCandidate(msg.candidate);
  }
};

That is essentially the whole thing for a 1:1 call. Once setLocalDescription and setRemoteDescription have both been called and ICE has gathered enough candidates, media starts flowing automatically. The browser handles encryption, packet loss recovery, jitter buffering and codec negotiation under the hood.

Optional: data channel for arbitrary data

Send arbitrary text or binary data peer-to-peer

// Caller creates the data channel BEFORE creating the offer
const channel = peerConnection.createDataChannel('chat', { ordered: true });
channel.onopen = () => channel.send('Hello!');
channel.onmessage = (e) => console.log('Received:', e.data);

// Callee receives it via the ondatachannel event
peerConnection.ondatachannel = (event) => {
  const channel = event.channel;
  channel.onmessage = (e) => console.log('Received:', e.data);
};

RTCDataChannel runs over SCTP-over-DTLS, so it inherits the same encryption as media. It supports both reliable+ordered (TCP-like) and unreliable+unordered (UDP-like) modes. Use cases: P2P file transfer, multiplayer game state, screen-sharing annotations, low-latency telemetry.

03WebRTC Signalling: You Bring Your Own

Of all the surprising things about WebRTC, this is the most confusing for engineers coming from SIP: WebRTC has no signalling protocol. The W3C and IETF deliberately punted on it. They standardised the media plane and left signalling entirely to applications.

Why no built-in signalling?

The reasoning was that signalling needs are wildly different across applications. A Slack-style team chat needs presence, message history, and per-user routing. A click-to-call button needs only a one-shot session setup. A video doorbell needs push notifications to wake the recipient device. A peer-to-peer game needs lobby logic. No single signalling protocol could fit all of these without forcing trade-offs, so the standards committees left the field open.

The realistic options in 2026

Signalling channel	What it actually is	When to use it
WebSocket + JSON	A persistent bidirectional channel, your own message schema	Most new applications. Fits naturally with browser-side WebRTC.
SIP-over-WebSocket (RFC 7118)	Standard SIP messages tunnelled inside WebSocket frames	Bridging WebRTC clients into existing SIP infrastructure.
Long-poll HTTP	Repeated HTTP requests with hold-open semantics	Behind aggressive corporate proxies that block WebSocket.
MQTT	Pub/sub messaging over a broker	IoT scenarios where MQTT is already in the stack.
Firebase / Pusher / Ably	Hosted real-time messaging services	Skipping signalling infrastructure entirely; pay per session.
Matrix / XMPP	Federated messaging protocols	Federated communications where ownership matters.

What your signalling channel actually carries

For a 1:1 call you need to exchange three kinds of message:

Session description (SDP) offer and answer — the codec and media negotiation, identical to SIP's offer/answer.
ICE candidates — possible network addresses each side can use, sent as they are discovered (trickle ICE).
Application messages — "hang up", "mute", "user X has joined the room", and whatever else your app needs.

That is a tiny amount of structured data. A typical signalling implementation is 50-200 lines of server code plus matching client code, which is why most teams build their own rather than adopt SIP.

The most common signalling mistake: trying to make signalling reliable and ordered when WebRTC does not require either. Offers, answers and candidates can arrive in any order; trickle ICE specifically expects candidates to flow asynchronously. Your signalling needs to be delivered, not strictly ordered. Treat it like a sequence of one-way messages, not a state machine.

04NAT Traversal: ICE, STUN and TURN

Most internet endpoints sit behind NAT (Network Address Translation), which maps a private IP (192.168.x.x, 10.x.x.x) to a shared public IP. Two NATed peers cannot send each other media directly without help — they don't know their own public addresses, and even if they did, the firewall might drop unsolicited inbound traffic. WebRTC solves this with three protocols layered together.

STUN: "What IP do you see me on?"

STUN (Session Traversal Utilities for NAT, RFC 8489) is a small protocol where a client asks a public STUN server "what source IP and port did you receive my packet from?" The server replies with what it saw, which is the client's public-facing NAT mapping. The client now knows a public address it can advertise. STUN is cheap, stateless, and Google operates a free public server (stun:stun.l.google.com:19302) used by half the internet. Roughly 80% of WebRTC sessions complete using only STUN-discovered addresses.

TURN: "Just relay everything for me"

TURN (Traversal Using Relays around NAT, RFC 8656) is the fallback for the 10-20% of sessions where STUN isn't enough — usually when one or both peers are behind a "symmetric NAT" or aggressive corporate firewall. A TURN server allocates a public address on behalf of the client and relays all media through itself. It works in essentially every network environment because it looks like a regular outbound TCP/UDP connection. The downside: TURN servers carry the actual media traffic, which costs bandwidth (and money) proportional to call volume.

ICE: the algorithm that uses both

ICE (Interactive Connectivity Establishment, RFC 8445) is the algorithm that combines STUN, TURN and direct local addresses into a working connection. Each peer:

Gathers candidates: every local IP, every STUN-discovered public address, every TURN-allocated relay address.
Sends candidates to the remote peer via signalling (trickle ICE = send each as discovered, don't wait).
Pairs each local candidate with each remote candidate and tries connectivity checks on each pair.
The first pair that succeeds wins; media starts flowing on it.
Better pairs that succeed later can take over (ICE restart).

STUN vs TURN at a glance

Aspect	STUN	TURN
Role	Discovery only	Media relay
Carries media?	No	Yes — all of it
Bandwidth cost to operator	Negligible (a few packets per session)	Significant (full session bandwidth, both directions)
Sessions that need it	~80%	~10-20%
Public free option	Yes — Google, Cloudflare, etc.	No — you must run or pay for one
Authentication	None needed	Required — otherwise anyone can relay through you

You need a TURN server in production. A WebRTC application running with only STUN will work for 80% of users and silently fail for the other 20%, who happen to include most enterprise users behind corporate firewalls. Run your own TURN (coturn is the de-facto open-source server) or use a hosted service (Twilio, Xirsys, Cloudflare Calls). Skipping TURN to save money is a recurring root cause of "WebRTC works in dev but fails in production for some users".

05Media Negotiation: Offer/Answer with SDP

WebRTC reuses the same offer/answer model that SIP uses, with the same SDP (Session Description Protocol) body format. If you already understand SIP/SDP this section is mostly familiar — the differences are at the edges.

The flow in WebRTC terms

Caller: createOffer() → produces an SDP describing what the caller wants to send and receive.
Caller: setLocalDescription(offer) → commits the offer locally.
Caller sends the SDP to the callee via signalling.
Callee: setRemoteDescription(offer) → tells WebRTC about the offer.
Callee: createAnswer() → produces an SDP picking from the offer's options.
Callee: setLocalDescription(answer) → commits the answer.
Callee sends the answer SDP back via signalling.
Caller: setRemoteDescription(answer) → locks in the agreed parameters.

What a WebRTC SDP offer actually looks like

A typical WebRTC offer (truncated for readability)

v=0
o=- 8623892631876489173 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE 0 1
a=msid-semantic: WMS local-stream

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 9 0 8
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:F7gI
a=ice-pwd:x9cml/YzichV2+XlhiMu8g
a=fingerprint:sha-256 4A:AD:B9:B1:3F:82:18:3B:54:02:12:DF:3E:5D:49:6B:19:E5:7C:AB
a=setup:actpass
a=mid:0
a=sendrecv
a=rtpmap:111 opus/48000/2
a=fmtp:111 minptime=10;useinbandfec=1
a=rtcp-fb:111 transport-cc

m=video 9 UDP/TLS/RTP/SAVPF 96 97 98 99 100 101
c=IN IP4 0.0.0.0
a=rtpmap:96 VP8/90000
a=rtpmap:97 VP9/90000
a=rtpmap:99 H264/90000
a=rtpmap:100 AV1/90000
a=fmtp:99 profile-level-id=42e01f;level-asymmetry-allowed=1
a=rtcp-fb:96 nack
a=rtcp-fb:96 nack pli
a=rtcp-fb:96 ccm fir
a=rtcp-fb:96 transport-cc

Compared to a SIP-side SDP, the differences are concentrated in three places:

Profile is UDP/TLS/RTP/SAVPF, not RTP/AVP. SAVPF means "Secure RTP with feedback" — encryption is mandatory, and the F means "supports feedback messages" (NACK, PLI, FIR for video error recovery).
DTLS fingerprint instead of SDES keys. The a=fingerprint line lets the peer verify the DTLS handshake. Keys are exchanged in-band on the media port via DTLS, not in SDP.
BUNDLE. The a=group:BUNDLE 0 1 line says all media streams will share a single ICE/DTLS connection rather than each getting its own. Saves a port and an ICE round-trip.

Trickle ICE: candidates flow asynchronously

WebRTC traditionally generates ICE candidates after SDP is set. As each candidate is gathered, the onicecandidate event fires; you forward each one through signalling. The remote peer adds it via addIceCandidate(). This means the SDP exchange and the ICE exchange happen in parallel, which gets the call up faster than waiting for all candidates before sending the offer.

You will sometimes see non-trickle implementations where the SDP isn't sent until ICE gathering completes. This is simpler to implement but slower and incompatible with some servers. For new code, always trickle.

06WebRTC Security

WebRTC was designed in 2011 with security baked in, not bolted on. There is no plaintext WebRTC. There is no opt-out from encryption. The browser will refuse to grant camera or microphone access on insecure pages. This is unusual for a real-time protocol and worth understanding because it shapes what you can and cannot do.

Mandatory encryption: DTLS-SRTP

All WebRTC media flows over SRTP, with keys generated by a DTLS handshake performed inside the media path itself. The DTLS handshake produces a master secret that both peers derive SRTP keys from. The fingerprint of each peer's DTLS certificate is published in SDP, so peers verify they are talking to the entity the signalling channel claimed.

This means: even if your signalling channel is compromised, an attacker cannot decrypt the media without also being a man-in-the-middle on the DTLS handshake. As long as the SDP fingerprints survive transit unmodified, the media is safe. Compromising the signalling alone gets the attacker session metadata (who called whom, when) but not call content.

HTTPS required for getUserMedia

Every modern browser refuses to expose navigator.mediaDevices.getUserMedia on plain HTTP pages. You must serve over HTTPS or your WebRTC code will throw TypeError: navigator.mediaDevices is undefined. The one exception is http://localhost, which the browser treats as a "secure context" for development convenience.

Practical implication: if you are deploying a WebRTC app, the entire page must be served over HTTPS, including any iframes containing the WebRTC code. There is no graceful degradation.

User consent for capture

The first call to getUserMedia() on a given origin triggers a browser permission prompt. The user can grant access, deny it, or grant for this session only. Browsers remember this decision per origin (across pages, but not across origins), and most browsers display a persistent indicator (camera light, address-bar icon) while capture is active. There is no way for a WebRTC application to capture media silently — this is enforced at the browser level.

What WebRTC does NOT secure

Signalling — that is your responsibility. Use WSS (WebSocket over TLS), HTTPS, or whatever secure transport fits your signalling protocol.
TURN credentials — TURN passwords sent in JavaScript can be inspected by any user. Use short-lived (TTL) TURN credentials issued per-session by your back-end.
Application authentication — "is this user actually who they say they are" is your job, not WebRTC's.
IP address privacy — WebRTC will reveal the user's local IP addresses by default during ICE gathering. This is a known privacy concern; mitigations include using only TURN-relay candidates (iceTransportPolicy: 'relay') at the cost of forcing all media through your TURN server.

Identity verification: WebRTC's optional identity assertion (RTCDtlsTransport + IdP integration) lets you tie a peer's DTLS fingerprint to a verified identity (e.g. a Google or Mozilla account). Almost no-one uses this in 2026. Most production deployments rely on the application-level signalling auth instead, which works fine but does mean a compromised signalling server can impersonate participants.

07WebRTC Architectures: P2P, SFU and MCU

WebRTC is point-to-point at the protocol level, but real-world group calls need server-side help. There are three patterns, each with sharp trade-offs.

P2P / Mesh

Each participant connects directly to every other participant. For N participants you need N×(N-1)/2 connections globally and (N-1) connections per client. Simple to build — no media server needed — and the call quality is excellent because there is no extra hop. Falls apart above 4-5 participants because each client has to encode and upload its stream multiple times in parallel, which kills upload bandwidth and CPU on consumer devices.

SFU (Selective Forwarding Unit)

A server in the middle that receives one stream from each participant and forwards copies to the others. The server does not decode or mix — it just routes encrypted RTP packets. Each client uploads once and downloads once per other participant. Scales comfortably to 25-50 participants, with simulcast (multiple resolution layers per upload) extending that further. Most modern video platforms (Google Meet, Daily, Jitsi, Zoom in part, Discord) use SFUs.

SFU advantages: clients render their own UI (every video tile is a separate stream), bandwidth efficient, server CPU light. Disadvantages: each client downloads many streams (heavy on receive bandwidth), debug complexity, simulcast configuration is non-trivial.

MCU (Multipoint Control Unit)

A server that decodes every incoming stream, mixes them into a single composite stream (e.g. tiled video, mixed audio), and sends one stream to each participant. Each client uploads once and downloads once total — a fixed cost regardless of participant count. Scales to hundreds of participants because the per-client load is constant. Used by traditional conferencing systems and webinar platforms.

MCU advantages: lowest client bandwidth, identical view for everyone, works on weak clients. Disadvantages: server CPU is enormous (full transcode per session), all clients see the same layout, latency higher than SFU because of the mix step.

The trade-off table

Aspect	P2P / Mesh	SFU	MCU
Server cost	None	Low — routes packets	High — transcodes everything
Client upload bandwidth	(N-1) × stream	1 × stream (with simulcast: 2-3×)	1 × stream
Client download bandwidth	(N-1) × stream	(N-1) × stream	1 × stream
Practical max participants	4-5	25-50	200+
UI flexibility	Per-client	Per-client	Server decides layout
Latency overhead	None	~5-10ms	~50-100ms
Recording difficulty	Hard — no central point	Easy — tap server	Trivial — output is mixed
Open-source options	Just WebRTC	mediasoup, Jitsi, Janus, LiveKit	FreeSWITCH, Janus (mixing plugins)

Which one for your use case

1:1 calling — P2P. Anything else is over-engineering.
Small team meetings (3-25 people) — SFU. The default modern choice.
Webinars and large rooms (50+ people) — SFU with simulcast, or MCU if clients are weak.
Voice AI / call centre integrations — SFU at the edge, then bridge media into your AI pipeline. The SFU pattern is also what Team-Connect uses to multiplex multiple AI agents into a single call without forcing the customer's phone to handle multiple streams.
Live streaming (one-to-many) — technically possible with SFU, but at scale you should look at HLS or LL-HLS instead. WebRTC makes sense up to maybe 1000 viewers; beyond that it gets uneconomical.

08WebRTC vs SIP (and How to Bridge Them)

SIP and WebRTC are not competitors — they solve different parts of the same problem and are routinely used together. The real-world pattern in 2026 is "WebRTC at the browser, SIP at the carrier, gateway in the middle".

Side-by-side comparison

Feature	SIP	WebRTC
Standardised since	2002 (RFC 3261)	2011 chartered, 2021 W3C Recommendation
Where it runs	Phones, PBXs, gateways, carriers	Browsers, mobile apps, embedded SDKs
Signalling	Defined by the protocol itself	Application's responsibility
Default transport	UDP/TCP/TLS port 5060/5061	UDP/TCP, ephemeral ports, BUNDLEd
NAT traversal	External (SBC, manual config)	Built-in (ICE, STUN, TURN mandatory)
Encryption	Optional (SIPS, SRTP)	Mandatory (DTLS-SRTP)
Codec negotiation	SDP offer/answer	SDP offer/answer (same model)
PSTN interoperability	Native	Requires a SIP gateway
Browser support	None natively (needs SIP-over-WebSocket gateway)	Universal in modern browsers
Best at	Carrier-grade, legacy phones, dedicated devices	Browser-native, mobile apps, modern UX

How a SIP-WebRTC gateway actually works

A SIP-WebRTC gateway is a B2BUA (Back-to-Back User Agent) that terminates one protocol on each side. When a browser-based caller wants to reach a phone number:

Browser sends an SDP offer over WebRTC to the gateway. SDP profile is UDP/TLS/RTP/SAVPF, candidates are ICE, encryption is DTLS-SRTP.
Gateway translates the SDP into SIP-flavoured form: RTP/AVP profile (or RTP/SAVP for SRTP), removes ICE candidates, swaps DTLS fingerprint for SDES keys.
Gateway sends the translated SDP as a SIP INVITE to the carrier's SIP trunk.
Phone rings. When answered, the gateway receives 200 OK with the phone's SDP, translates it back, and sends it as a WebRTC answer to the browser.
Media flows: browser↔gateway is DTLS-SRTP, gateway↔carrier is plain RTP or SRTP. The gateway transcodes if codecs don't match (Opus from browser, G.711 to carrier is the common case).

Open-source gateways worth knowing

FreeSWITCH — the original SIP/WebRTC SBC. Production-grade, large community, Lua/JavaScript scripting.
Asterisk with chan_pjsip and res_websocket — popular for smaller deployments, dialplan-driven.
Janus Gateway — modular WebRTC server with a SIP plugin. Good for application-specific gateways.
Kamailio — SIP proxy with WebSocket support. Pairs with RTPEngine for media.
Drachtio — Node.js SIP framework, modern API, good for custom gateway logic.

The pragmatic 2026 architecture: if you are building a voice AI product, WebRTC at the browser/app, SIP at the PSTN edge, and a B2BUA gateway in the middle bridging the two. Add an SFU for any multi-party scenarios. Keep your AI back-end on a clean WebSocket protocol — do not try to make your LLM speak SIP. Translate at the edges, keep the middle simple.

09Common WebRTC Issues and Troubleshooting

WebRTC failures are usually one of about six recurring patterns. Triage in this order.

"navigator.mediaDevices is undefined"

The page is on plain HTTP. Browsers do not expose getUserMedia on insecure origins. Move the page to HTTPS (or to http://localhost for dev) and the API will appear. There is no workaround; this is enforced at the browser level.

getUserMedia rejected with NotAllowedError

The user denied permission, or the browser blocked the prompt because of a previous denial. Check the address-bar permissions UI. If permission was previously denied for the origin, the user has to manually unblock it — a fresh getUserMedia call won't re-prompt. Always handle this error gracefully and explain to the user how to unblock.

ICE failed (call connects but media never starts)

The most common production WebRTC failure. ICE could not find a candidate pair that works for both sides. Almost always means TURN is missing or misconfigured. Check the iceConnectionState events: a transition to failed means no candidate pair succeeded. Verify your TURN server is reachable from real client networks (test from a guest WiFi, not just your office), the credentials are valid, and you are listening on both UDP and TCP (corporate firewalls often block UDP entirely).

One-way audio or video

Connection establishes, but only one direction works. Common causes: tracks added to RTCPeerConnection in the wrong order (caller added local tracks before remote SDP arrived, callee added too late), a=sendonly or a=recvonly in the SDP when a=sendrecv was expected, or one side's NAT happens to allow outbound but block inbound on the chosen candidate. Diagnosis: check peerConnection.getStats() for the inbound and outbound RTP streams — if one direction shows zero packets received, the path in that direction is broken regardless of what the other direction is doing.

Codec mismatch / black video

Connection up, audio works, video shows but is black or frozen. Causes: the H.264 profile your encoder produced isn't supported by the decoder (check the profile-level-id in SDP); hardware acceleration is failing; or the receiver's RTCRtpReceiver is dropping frames. Quick test: switch the codec to VP8 (which has the widest support) and see if the issue persists.

"Works in Chrome, broken in Safari"

Safari's WebRTC implementation has historically been the strictest about edge cases. Common Safari-specific gotchas: Safari requires user gesture before getUserMedia in some contexts, Safari is more aggressive about closing data channels on tab backgrounding, and Safari's H.264 support has profile-level limits Chrome doesn't enforce. Use adapter.js (the Google-maintained shim) to normalise browser differences — saves a surprising amount of debugging time.

Connection drops after a few minutes

Usually a NAT binding timing out. Most NATs drop UDP mappings after 30-60 seconds of idle. WebRTC sends keep-alives over the active ICE candidate pair, but if those keep-alives are blocked or the pair fails, the connection silently dies. Mitigations: use RTCPeerConnection.restartIce() on detection of a degraded connection, or pin to a TURN-relay candidate (iceTransportPolicy: 'relay') which has stable timing through your own server.

Always check getStats(): peerConnection.getStats() returns a goldmine of per-stream telemetry — bitrate, packet loss, jitter, round-trip time, codec actually in use, candidate pair selected. 95% of WebRTC mysteries are solvable by inspecting these stats during the failing call. chrome://webrtc-internals/ in Chrome and about:webrtc in Firefox give you live stats dashboards for free.

WebRTC Integration FAQs

The questions our voice AI customers ask most often when building browser-based or mobile-app real-time features.

What is WebRTC in simple terms?

WebRTC (Web Real-Time Communications) is a free, open standard that lets browsers and mobile apps send audio, video and arbitrary data peer-to-peer with no plugins required. It is built into every modern browser since 2017. WebRTC handles the hard parts of real-time communications — codec negotiation, NAT traversal, encryption, packet loss recovery — and exposes them through a small JavaScript API: getUserMedia for capture, RTCPeerConnection for the call, RTCDataChannel for arbitrary data.

What is the difference between WebRTC and SIP?

SIP is a signalling protocol used between dedicated VoIP infrastructure — phones, PBXs, gateways, carriers. WebRTC is a complete browser-side real-time stack with no defined signalling protocol. SIP requires SIP infrastructure to work. WebRTC works in any browser with a few lines of JavaScript and a signalling channel of your choice (typically WebSocket carrying JSON). Most modern voice platforms run both: SIP between servers and the PSTN, WebRTC for browser endpoints, with a gateway translating between them.

Why does WebRTC need STUN and TURN?

Most internet endpoints sit behind NAT (Network Address Translation), which maps a private IP to a public IP. For two NATed peers to send each other media directly, they have to discover their public addresses (STUN) and sometimes route through a relay if direct connection fails (TURN). STUN is cheap — it just answers "what IP do you see me on?" TURN is expensive because it carries all the media traffic. ICE is the algorithm that uses STUN and TURN together to find the best path. Roughly 10-20% of WebRTC sessions need TURN; the rest connect directly via STUN-discovered addresses.

Does WebRTC have built-in signalling?

No — and this is by design. WebRTC standardised the media plane (codecs, encryption, NAT traversal) but deliberately left signalling to the application. You need to build or choose a signalling channel yourself: most apps use WebSocket carrying JSON messages; some use SIP-over-WebSocket (RFC 7118); others use long-poll HTTP, MQTT, or platform messaging APIs like Firebase Cloud Messaging. The advantage is flexibility — you fit the signalling to your app's authentication, presence and routing needs.

What is the difference between SFU and MCU in WebRTC?

An SFU (Selective Forwarding Unit) is a server that receives one media stream from each participant and forwards copies to the others without decoding or mixing them. An MCU (Multipoint Control Unit) decodes all incoming streams, mixes them into a single composite stream, and sends one mixed stream to each participant. SFUs are cheaper and let clients render their own UI; MCUs use less client bandwidth but are CPU-expensive. For 2-50 participant calls SFU is almost always the right answer; beyond 50 the trade-offs get interesting and some systems use a hybrid.

Is WebRTC encrypted by default?

Yes — WebRTC mandates encryption. All media flows over SRTP keyed by DTLS (DTLS-SRTP), and the keys are generated fresh for each session via a DTLS handshake performed inside the media path. There is no way to disable encryption in standards-compliant WebRTC implementations. The signalling channel you choose, however, is your responsibility — getUserMedia also requires HTTPS (or localhost) on every modern browser, so the page hosting WebRTC code must be served securely or the API will refuse to grant camera/microphone access.

Can WebRTC connect to a regular phone number?

Not directly. Regular phone numbers live on the PSTN, which speaks SIP and traditional telephony. To connect a WebRTC browser endpoint to a phone number you need a gateway — usually a B2BUA that terminates WebRTC on one side and SIP on the other, then connects to a SIP trunk for PSTN access. This is exactly how "click to call" buttons on websites work: the browser opens WebRTC to a server, the server bridges to SIP, the SIP trunk reaches the phone. Team-Connect's voice AI uses this pattern routinely.

How do I integrate WebRTC into my application?

At a minimum: (1) call getUserMedia to capture the local camera/mic stream; (2) create an RTCPeerConnection and add the local stream to it; (3) call createOffer, setLocalDescription, and send the resulting SDP to the remote peer via your signalling channel; (4) on the remote side, setRemoteDescription with the offer, createAnswer, setLocalDescription, and send the answer back; (5) trickle ICE candidates between peers as they are gathered; (6) attach the remote stream to a video element when it arrives via the ontrack event. Add HTTPS, a STUN server, and ideally a TURN server for production, and you have a working WebRTC integration.

Continue Reading

WebRTC is the browser-side half of modern voice AI. To go deeper into the other side:

SIP Protocol Basics → Audio Codecs Explained → µ-law Encoding → Business Landline → AI Receptionist →