Slack's January 2021 Outage Explained

On January 4th, 2021, Slack experienced a significant outage that affected millions of users worldwide. While the root cause was traced to degraded AWS Transit Gateways (network connectivity issues), an interesting cascade of events occurred when Slack tried to scale its systems to handle this failure. Let's break down one of these critical scaling issues using an analogy that everyone can understand.

Case Link

The Technical Summary

Slack's web infrastructure scales on two key metrics:

CPU utilization
Apache worker thread utilization

During the outage, network issues created a cascade of events:

Network problems caused threads to wait longer
CPU utilization dropped initially, triggering downscaling
Thread utilization spiked, prompting rapid upscaling
Attempts to add 1,200 servers failed due to provision-service limitations

The Restaurant Analogy: Understanding What Went Wrong

Imagine you're running a popular restaurant on New Year's evening. Your restaurant has a sophisticated system to manage staff levels, similar to how Slack manages its servers.

The Setup

Your restaurant decides how many waiters to have based on two things:

How tired the waiters are (CPU utilization)
How many waiters are currently busy with customers (Apache worker threads)

The Problem Unfolds

On this particularly busy New Year's evening, all the main highways in the city are experiencing severe traffic jams (degraded AWS Transit Gateways). This means everything - from food deliveries to staff commutes - is taking much longer than usual. Here's what happens:

The Initial Confusion
- Your waiters are standing around waiting for food from the kitchen
- They're not tired (low CPU), so you think you need fewer waiters
- Some waiters are sent home (downscaling)
The Real Crisis
- Suddenly, all remaining waiters get stuck waiting for food
- They're not physically tired, but they're all occupied
- No one is available to take new customer orders
- You realize you need many more waiters, fast!
The Failed Solution Your hiring manager (provision-service) tries to hire 1,200 new waiters in just 15 minutes, but:
- The hiring system is affected by the same slowdown
- They run out of paperwork forms (Linux open files limit)
- They hit maximum hiring quotas (AWS quota limit)
- The entire hiring process grinds to a halt

The Key Insight

The most interesting part of this failure is how the solution became impossible due to the original problem. It's like being stuck in a traffic jam and calling for more buses to help – but those buses are stuck in the same traffic trying to reach you!

Lessons Learned

This incident highlights the importance of:

Having backup systems for critical operations
Understanding how different parts of a system affect each other
Planning for scenarios where the solution might be affected by the original problem

The next time you're waiting for a Slack message to send, remember: even the most sophisticated systems can face challenges similar to a busy restaurant during a holiday crisis. It's all about managing resources, responding to problems, and having robust backup plans.

Understanding Slack's January 2021 Outage: A Restaurant Analogy