CrowdStrike Outage

19th of July, Friday morning's events serve as a stark reminder of the critical importance of best practices and quality standards. CrowdStrike, a leading cybersecurity company valued at around $80B, had a huge outage because of a failed global software update.The impact has been catastrophic, affecting airports, hospitals, flight operators, train services, TV broadcasters, and more. This incident highlights the necessity of adhering to robust deployment strategies, particularly for critical infrastructure.

What Happened?

CrowdStrike's software, which provides antivirus, firewall, intrusion detection, encryption, and application control for Windows machines, was updated globally. This update crashed Windows systems worldwide, leading to widespread disruption. The impact is huge, causing essential services to stop and showing the dreaded Windows Blue Screen of Death.

Usually, when buggy code is pushed to production, the solution is to revert to the previous version or deploy a fix. However, in this case, the machines are non-functional, requiring a manual and time-consuming recovery process. Each affected machine must be booted in safe mode, a file deleted, and then rebooted.. There was a workaround suggested, and it was brutal:

The workaround is absolutely brutal:

There is a workaround...
1. Boot Windows into Safe Mode or WRE.
2. Go to C:\Windows\System32\drivers\CrowdStrike
3. Locate and delete file matching "C-00000291*.sys"
4. Boot normally.

The Failure of Deployment Practices

The main problem here is the lack of a staged rollout, also called canary deployment. This method deploys changes to a small group of users first, so any problems can be found and fixed early. Skipping this important step in CrowdStrike's deployment process is hard to understand, especially for a company handling important security software.

Staged rollouts are essential because they minimize risk. A canary deployment would have identified the issue in a controlled environment, avoiding the widespread impact seen today. Skipping this step is not just a technical oversight but a profound failure in deployment strategy and quality assurance.

The Industry's Flawed Mindset

This incident highlights a common problem in the software industry: not valuing safe practices enough. Even though we often talk about the benefits of canary releases, dark launches, safe deployments, rollbacks, and staged rollouts, these methods are frequently ignored as too cautious or strict. The focus on quick releases and immediate profits often outweighs the importance of quality and safety.

The reality is that many software companies operate under the philosophy that getting close to the edge of failure is acceptable, as long as the minimum necessary measures are in place to prevent disaster. Engineers advocating for robust measures to keep failures at bay are frequently told they lack business acumen.

Embracing High-Quality Standards

To prevent future incidents like the CrowdStrike outage, companies must adopt a mindset prioritizing high quality and best practices. Here are some key areas to focus on:

CI/CD Pipelines: Implementing continuous integration and continuous deployment (CI/CD) pipelines is crucial. CI/CD ensures that code changes are automatically tested and deployed, reducing the likelihood of errors reaching production. Each change is checked through automated testing, which acts as a safety net to catch issues early.
Staged Rollouts and Canary Releases: Always deploy changes to a small subset of users first. This controlled approach allows for the identification and resolution of issues before a full-scale rollout.
Observability: Invest in robust observability tools that provide real-time insights into system performance. This includes logging, monitoring, and alerting systems that can detect anomalies and trigger immediate responses.
Continuous Improvement: Foster a culture of continuous improvement where feedback loops are established, and lessons learned from incidents are integrated into the deployment process. Regularly review and update deployment practices to align with the latest industry standards and technologies.

Conclusion

The CrowdStrike outage is a wake-up call for the software industry. It highlights the dire consequences of neglecting best practices and prioritizing speed over quality. By embracing a culture of high quality, adhering to best practices, and implementing robust CI/CD pipelines and observability tools, companies can prevent similar disasters and ensure the reliability and security of their systems.

Lessons from the CrowdStrike Outage: Why Best Practices Matter