Building Nexus RTC: Why I'm Building a Self-Hosted Agora Alternative from Scratch

In late 2024, while scoping a real-time feature for a client's platform, I priced out Agora RTC for 50,000 daily active users averaging 20 minutes of call time. The number came back at around $18,000/month. For a Series A startup, that's a significant infrastructure line item — one that only grows as the product succeeds.

I started researching self-hosted alternatives. LiveKit is excellent as an SFU. Mediasoup gives you raw building blocks. Jitsi works for meetings. But none of them give you what Agora does: a complete platform with polished cross-platform SDKs (iOS, Android, Web, Flutter, React Native, Unity), cloud recording, live transcription, QoE analytics, and a CDN-backed delivery network. To build that on top of LiveKit would take a team months.

That's the gap Nexus RTC is filling. This post is about the architecture — what I'm building, the decisions I've made, and the hard problems I'm still solving.

What Makes WebRTC Hard in Production

WebRTC has been around since 2011. The browser API is reasonably well-documented. Most developers can get a two-peer video call working in a weekend. Production is a completely different animal.

NAT Traversal

The majority of devices on the internet are behind NAT — home routers, corporate firewalls, mobile carrier NAT. Two devices behind NAT can't connect to each other directly. WebRTC uses ICE (Interactive Connectivity Establishment) to punch through NAT, which involves STUN servers for address discovery and TURN servers as relay fallbacks. In my testing, about 15% of connections require TURN relay. TURN traffic is expensive to run at scale.

Scalability: P2P Doesn't Scale

P2P WebRTC works for 1:1 calls. For group calls, it breaks down fast. If you have 8 people in a call and every participant sends to every other participant, each person is uploading 7 streams simultaneously. At 1080p that's catastrophic. The solution is an SFU (Selective Forwarding Unit) — a server that receives streams from each participant and selectively forwards them to others. The SFU doesn't decode or re-encode; it just routes packets. This is why LiveKit, Mediasoup, and Janus exist.

Codec Negotiation

Not every device supports the same codecs. VP8, VP9, H.264, H.265, AV1 — each has different hardware acceleration profiles, patent status, and browser support. WebRTC's SDP negotiation handles this, but you need to be careful about which codecs you prioritize for different device classes. On flagship Android phones, AV1 encoding is hardware-accelerated and gives the best quality per bit. On iOS, H.264 is the battle-tested choice. Getting this wrong wastes bandwidth or CPU.

The Architecture

Nexus RTC is built in layers:

text

┌─────────────────────────────────────────────────────┐
│                    SDK Layer                        │
│   Web SDK  │  iOS SDK  │  Android SDK  │  Flutter  │
│            │           │               │  React Native │
└────────────────────────┬────────────────────────────┘
                         │ WebSocket (Signaling)
                         │ DTLS-SRTP  (Media)
┌────────────────────────▼────────────────────────────┐
│              Signaling Server (Node.js)             │
│   Session management · Room state · ICE exchange   │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│              SFU Layer (LiveKit)                    │
│   Audio/video routing · Simulcast · Bandwidth est. │
└───────────┬────────────────────────────┬────────────┘
            │                            │
┌───────────▼──────────┐   ┌────────────▼────────────┐
│  Recording Pipeline  │   │   QoE Analytics         │
│  (FFmpeg + S3)       │   │   (ClickHouse)          │
└──────────────────────┘   └─────────────────────────┘

The Signaling Server

The signaling server handles session lifecycle — creating rooms, authenticating participants, exchanging ICE candidates and SDP offers/answers between peers. I built it in Node.js using WebSockets (ws library) with Redis for session state so it's horizontally scalable.

typescript

// Simplified signaling flow
class SignalingServer {
    async handleOffer(ws: WebSocket, payload: OfferPayload) {
        const { roomId, participantId, sdp, iceServers } = payload;

        // Validate JWT, check room exists
        const room = await this.roomStore.get(roomId);
        if (!room) throw new SignalingError("ROOM_NOT_FOUND");

        // Forward SDP offer to LiveKit SFU via gRPC
        const answer = await this.livekitClient.publishTrack({
            roomName: roomId,
            participantIdentity: participantId,
            offer: sdp,
        });

        // Send answer back to SDK
        ws.send(JSON.stringify({
            type: "answer",
            sdp: answer.sdp,
            iceServers: this.getIceServers(room.region),
        }));

        // Emit participant joined event for QoE tracking
        await this.analytics.emit("participant_joined", {
            roomId, participantId,
            timestamp: Date.now(),
            region: room.region,
        });
    }
}

The Web SDK

The Web SDK wraps the browser's WebRTC APIs and the signaling protocol behind a clean developer interface. The goal: an Agora-like DX where you join a room and publish a track in under 10 lines of code.

typescript

import { NexusClient } from "@nexus-rtc/web";

const client = new NexusClient({
    appId: "YOUR_APP_ID",
    token: await fetchToken(userId, roomId), // JWT from your server
});

const room = await client.join(roomId);

// Publish local camera
const cameraTrack = await room.createCameraVideoTrack({
    resolution: "720p",
    codec: "h264", // auto-detected if omitted
});
await room.publish(cameraTrack);

// Subscribe to remote participants
room.on("participant-joined", (participant) => {
    participant.on("track-published", (track) => {
        const element = document.createElement("video");
        track.attach(element);
        document.body.appendChild(element);
    });
});

Cross-Platform SDK Strategy

Writing six SDKs (Web, iOS, Android, Flutter, React Native, Unity) from scratch would take years. My strategy is layered: write core WebRTC logic once in C++ using libwebrtc, then wrap it with thin platform-native bindings. The C++ core handles codec selection, bitrate adaptation, packet loss concealment, and jitter buffer management — the hard stuff that needs to be consistent across platforms.

C++ Core: libwebrtc, codec management, network adaptation, audio processing
iOS: Swift wrapper using Objective-C bridge — exposes a Swift-idiomatic API
Android: Kotlin wrapper using JNI — follows Android Jetpack patterns
Flutter: Dart plugin using platform channels to the native iOS/Android wrappers
React Native: Native module for iOS + Android, with a clean JS/TS API on top
Web: TypeScript SDK using browser WebRTC APIs directly (no C++ needed)

QoE Analytics with ClickHouse

Call quality measurement is a first-class feature in Nexus RTC, not an afterthought. Every second, each SDK emits a telemetry packet: RTT, packet loss, jitter, audio level, video resolution, frame rate, codec, and estimated bandwidth. These land in ClickHouse — a columnar database designed for high-volume time-series analytical workloads.

Why ClickHouse instead of something like TimescaleDB or InfluxDB? ClickHouse can ingest 1 million rows per second on modest hardware and query billions of rows in seconds. For a platform where 10,000 concurrent sessions each emit 1 packet/second, that's 10,000 rows/second at baseline. ClickHouse handles it without breaking a sweat.

sql

-- Example: find calls with degraded quality in the last hour
SELECT
    room_id,
    participant_id,
    avg(packet_loss_pct)    AS avg_loss,
    avg(rtt_ms)             AS avg_rtt,
    avg(jitter_ms)          AS avg_jitter,
    min(video_resolution)   AS min_resolution
FROM qoe_events
WHERE
    timestamp >= now() - INTERVAL 1 HOUR
    AND (packet_loss_pct > 5 OR rtt_ms > 300)
GROUP BY room_id, participant_id
ORDER BY avg_loss DESC
LIMIT 100

Kubernetes Infrastructure

The infrastructure is defined in Terraform and deployed on DigitalOcean managed Kubernetes. The SFU is the most resource-intensive component — it processes and forwards all media packets. I use Kubernetes HPA (Horizontal Pod Autoscaler) to scale SFU instances based on CPU and a custom metric: active WebRTC tracks per pod.

One non-obvious challenge: SFU pods are stateful. When an SFU pod scales down, you can't just kill it — active rooms are running on it. The scale-down process drains a pod: it stops accepting new rooms, waits for existing rooms to end, then terminates. This graceful drain is implemented as a pre-stop hook with a 5-minute timeout.

What's Hard, What's Left

The hardest remaining problem is end-to-end encryption. Agora, Zoom, and Google Meet all have E2EE modes now. Implementing E2EE over WebRTC requires Insertable Streams (in Chrome/Firefox) to encrypt the media payload before it reaches the SFU. The SFU then forwards encrypted packets it can't read — which breaks server-side recording and transcription. Finding the right tradeoff between privacy and features is an open design question.

Nexus RTC is still in active development. The Web SDK and signaling server are working. The Android SDK is in alpha. If you're building a product that needs WebRTC at scale and you'd rather own your infrastructure, I'd love to talk.