Managing the World's Largest Server: Discord's Approach


how Discord managed the challenges posed by the Midjourney server, which had over two million active users.
The Problem: The influx of users due to Midjourney caused performance issues as the existing architecture struggled to handle the massive amount of real-time data. The traditional chat application architecture of sending all updates to all users became a bottleneck.
Initial Architecture: Discord used Elixir, a programming language known for fault tolerance and distributed applications, where one Elixir process sat between clients and the Discord API, fanning out updates. However this system scaled quadratically, leading to performance issue.
Fixes: Passive User Identification: Discord engineers realized that over 90% of users were passive (online but not actively participating in the chat). They stopped sending real-time updates to these users, significantly reducing the load on the system.
They introduced "relays," which are smaller Elixir instances placed between user sessions and the main Elixir process.These relays managed smaller pools of sessions, distributing the load. The fan-out method was refactored into a library that ran on both the main Elixir instance.
They implemented a hybrid approach where the list of all members was stored in ETS, and the most recent changes were kept in the process heap for faster access. These changes led to massive performance improvements in Discord's Elixir stack, especially in presence updates.
Subscribe to my newsletter
Read articles from Kunal Khare directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Kunal Khare
Kunal Khare
22, engineer, obsessed with iterating