WebRTC: Powering Your Video Calls

Ever wondered what goes behind apps like Google Meet or Zoom? How do they make it so easy for us to communicate through video, even when we're sitting on opposite sides of the planet?

Yeah, I came across this while building an app for myself. That’s when I learned more about WebRTC—how it works, what goes on behind the scenes, what its architecture looks like, its drawbacks, and how to overcome them. In this article, I’ll give you a brief intro to WebRTC and its working principles. So stay tuned to dive deep into the world of video calling.

WebRTC—What Is This?

It’s a protocol used to connect users directly so they can communicate, share files, stream video, and more.

But aren’t there other protocols for that? What’s so special about this one?

Right, there are many protocols. But WebRTC shines because of its ease of use, and the fact that you don’t need a dedicated server to monitor or relay the entire communication. It’s based on the UDP protocol.

Yes, you heard it right—WebRTC doesn’t need a third-party or cloud-based server for actual media transfer. It establishes a direct peer-to-peer connection between users, which also makes it faster.

So Why Don’t We Use This Everywhere?

You might have the same question in your head. The reason is packet loss.

WebRTC is based on UDP, which is fast, but not reliable—it doesn’t guarantee that all data packets will reach the destination. That’s fine for things like Zoom meetings or online games, where a small data loss (a freeze or glitch) doesn’t affect much.

But for messaging apps, we can’t afford to lose even a single packet—missing words or characters can change the entire meaning of a message. So for these cases, we use TCP, which ensures reliable and ordered delivery of data.

In short:

Use UDP/WebRTC for real-time media (video/audio).
Use TCP when all data must be received correctly (text, files).

How It Works – Architecture

Before diving in, let’s imagine a situation: You want to call your friend, family member, boss (to ask for leave 😁), or your wife (to plan a day out). To do that, you need their phone number. Only then can you call them.

Similarly, in WebRTC, you first need to know the IP address of the person you want to connect with. But before that, let’s understand public and private IP addresses.

Public and Private IP Addresses

Just like each SIM card has a unique number, every device needs a unique address on the internet to send and receive data.

With the rapid growth in internet users, assigning a unique public IP to every device isn't possible (due to IPv4 exhaustion). So instead, most devices are assigned private IP addresses by routers.

Public IP Address

This is what allows your device to access the internet.
It’s unique globally and exposed to the outer world.
Typically, only the router is assigned a public IP, and it communicates with the internet on behalf of all devices connected to it.

No two devices on the internet can have the same public IP.

Private IP Address

Assigned to devices inside a local network (by the router).
Can’t directly access the internet.
Managed through a process called NAT (Network Address Translation).

So how do you surf the internet with a private IP?

Here’s how:

When your device makes a request (e.g., opens a website), it goes to the router.
The router forwards this request using its public IP.
When the response comes back, it knows which device made the request and sends the data accordingly.

Private and Public IP addresses

TURN/ICE Servers

Now that you know about private and public IPs, let’s talk about TURN and ICE servers.

After a device gets its private IP, it still doesn’t know its public IP, which is needed to establish a connection. That’s where these come in:

STUN/TURN servers are simple services that see the public IP of incoming requests and send it back to the client.
This way, the client becomes aware of its public IP address.

ICE (Interactive Connectivity Establishment) is a framework that uses STUN and TURN to find the best route between two peers.

TURN server

Signaling Server

At this point, each user has the necessary information to establish a WebRTC connection—like codecs, IPs, ports, etc. This data is packaged in something called a Session Description Protocol (SDP).

But here's the catch:

You don’t have the other user’s SDP.
They don’t have yours.

So, we need a signaling server to exchange this information. Don’t worry—it’s a simple server that:

Takes the SDP of User A and sends it to User B.
Takes the SDP of User B and sends it to User A.

Now both users have each other’s details and can establish a direct WebRTC connection. After that, the signaling server isn’t needed anymore.

Signalling server

Drawbacks

Since WebRTC is peer-to-peer, only two users can connect at a time by default. You can’t host large video conferences using just this architecture.

This is perfect for apps like Omegle, where only two users need to connect. But for more users, we need other methods.

Mesh Topology (Not Ideal)

Mesh Topology

In mesh, each user connects directly to every other user.

Example: In a 4-user call, each user maintains 3 separate connections.

This quickly becomes:

Bandwidth-heavy
Hard to scale
Difficult to keep all streams in sync

That’s why mesh is rarely used in production apps.

Solution – SFU (Selective Forwarding Unit) and Multipoint Control Units (MCUs)

To solve the issue of scalability in WebRTC for group calls, we use SFU (Selective Forwarding Unit) or MCU (Multipoint Control Unit) as per the requirement. In both cases, we introduce a server into the architecture.

In these setups, a server acts like a virtual client, and all the users are connected to this server using a P2P configuration. This server handles the media streams from all users. In case of SFU, it simply forwards the streams to other users. In MCU, it combines all the streams into one and sends it back. This makes the data easier to manage, keeps it in sync, and helps in scaling the system to support many users at once. This is how big platforms like Google Meet, Zoom, Discord, Microsoft Teams, etc., manage large group calls.

SFU (Selective Forwarding Unit)

SFU

In SFU, each user sends their video/audio stream to the server, and the server’s job is just to forward those streams to the other users. It doesn’t mix or edit the media—it just acts like a smart postman.

This way, each user only needs to upload their stream once, and the server takes care of distributing it to everyone else. Since the server isn’t doing any processing, this setup is lightweight and scalable.

But there’s a catch. Each user still receives multiple streams—one from each participant—so the client device has to handle that load. Still, it’s much better than mesh topology, where everyone connects with everyone directly.

That’s why SFU is the go-to choice for group video calls in real-world apps. It's efficient, requires less server power, and performs well even with 10, 20, or more users.

MCU (Multipoint Control Unit)

MCU

In this setup, things work a bit differently. All users send their streams to the server, just like in SFU. But here, the server does a lot more—it mixes all the streams into one combined video (usually a grid of faces) and sends that single stream back to every user.

This keeps everything perfectly in sync and makes life easier for the client. Each device only needs to handle one incoming stream. But this also means the server is doing all the heavy lifting—mixing, encoding, and streaming—which requires a lot more processing power and resources.

MCU setups are great for situations like webinars or classrooms, where it’s more important for everything to be in sync, and where users don’t interact much.

Conclusion

With that, this article comes to an end. I hope you learned something new—because this was new to me too, and I thought, why not share my understanding with others?

Thanks for reading, and I hope you enjoyed the article.

Till we meet again—seeya, have a great day, keep learning, peace out.

Signing off.

WebRTC, the Brain Behind Your Virtual Video Calls