Understanding Slack's January 2021 Outage: A Restaurant Analogy
On January 4th, 2021, Slack experienced a significant outage that affected millions of users worldwide. While the root cause was traced to degraded AWS Transit Gateways (network connectivity issues), an interesting cascade of events occurred when Slack tried to scale its systems to handle this failure. Let's break down one of these critical scaling issues using an analogy that everyone can understand.
The Technical Summary
Slack's web infrastructure scales on two key metrics:
CPU utilization
Apache worker thread utilization
During the outage, network issues created a cascade of events:
Network problems caused threads to wait longer
CPU utilization dropped initially, triggering downscaling
Thread utilization spiked, prompting rapid upscaling
Attempts to add 1,200 servers failed due to provision-service limitations
The Restaurant Analogy: Understanding What Went Wrong
Imagine you're running a popular restaurant on New Year's evening. Your restaurant has a sophisticated system to manage staff levels, similar to how Slack manages its servers.
The Setup
Your restaurant decides how many waiters to have based on two things:
How tired the waiters are (CPU utilization)
How many waiters are currently busy with customers (Apache worker threads)
The Problem Unfolds
On this particularly busy New Year's evening, all the main highways in the city are experiencing severe traffic jams (degraded AWS Transit Gateways). This means everything - from food deliveries to staff commutes - is taking much longer than usual. Here's what happens:
The Initial Confusion
Your waiters are standing around waiting for food from the kitchen
They're not tired (low CPU), so you think you need fewer waiters
Some waiters are sent home (downscaling)
The Real Crisis
Suddenly, all remaining waiters get stuck waiting for food
They're not physically tired, but they're all occupied
No one is available to take new customer orders
You realize you need many more waiters, fast!
The Failed Solution Your hiring manager (provision-service) tries to hire 1,200 new waiters in just 15 minutes, but:
The hiring system is affected by the same slowdown
They run out of paperwork forms (Linux open files limit)
They hit maximum hiring quotas (AWS quota limit)
The entire hiring process grinds to a halt
The Key Insight
The most interesting part of this failure is how the solution became impossible due to the original problem. It's like being stuck in a traffic jam and calling for more buses to help – but those buses are stuck in the same traffic trying to reach you!
Lessons Learned
This incident highlights the importance of:
Having backup systems for critical operations
Understanding how different parts of a system affect each other
Planning for scenarios where the solution might be affected by the original problem
The next time you're waiting for a Slack message to send, remember: even the most sophisticated systems can face challenges similar to a busy restaurant during a holiday crisis. It's all about managing resources, responding to problems, and having robust backup plans.
Subscribe to my newsletter
Read articles from Akshay Siwal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by