How Chaos Engineering Helps You Prepare for Traffic Spikes, Crashes, and Cloud Outages


In the past, most applications were built as monoliths. The biggest concern then was something like a power outage taking down the whole system. But modern applications have evolved today, most are built using microservices, running in complex environments like Kubernetes on the cloud.
Let’s say you’re the DevOps lead at a FinTech startup that provides a payment API used by hundreds of e-commerce platforms. Your system is built using microservices on Kubernetes, hosted on AWS. It includes critical services such as:
API Gateway
Authentication
Payment Processing
MongoDB
Redis
You currently serve about 100,000 active users, but what if that number spikes to 1 million during a seasonal event? What happens if there's a partial AWS zone failure? Or your Redis instance crashes under heavy write load?
This is where Chaos Engineering becomes essential. Instead of waiting for failures to happen in production, Chaos Engineering allows you to simulate real world outages like service crashes, network latency, or infrastructure failure in a controlled way**.**
The goal isn’t to break things randomly, but to uncover hidden issues, test your system’s resilience, and ensure that your architecture can gracefully handle unexpected failures.
In today’s distributed world, resilience is no longer optional**.** If you want to build systems that thrive under pressure, Chaos Engineering isn’t just useful but critical.
What is Chaos Engineering?
Chaos Engineering is the practice of deliberately injecting faults into a system in a controlled environment to observe how it responds. The goal is to identify weaknesses and improve the architecture, making the system more resilient to real-world failures. In today’s world, a proactive approach to failure is far more valuable than simply reacting to disasters.
On October 4, 2021, Facebook and its family of apps went completely offline for nearly 6 hours. It wasn’t just your app acting weird everything was down, from messaging and login to internal tools used by Facebook employees themselves. And the cause? Not a cyberattack. Not a hardware failure but a misconfiguration. Facebook engineers were doing routine maintenance on the backbone network on the infrastructure that connects all their data centers globally. Normally engineers run commands to update network configurations but that day a faulty configuration command was issued, that command accidentally disconnected Facebook’s data centers from the internet**.** Imagine pulling the plug from the wall everything powered off from the outside world. This resulted in
3.5+ billion users lost access
Stock dropped by $6 billion.
Businesses relying on WhatsApp lost sales and communication.
People turned to Twitter, Telegram, and Signal.
What we can learn
Even the biggest companies can go dark due to simple misconfigurations
Having out-of-band access, fallback tools, and manual overrides is critical
Chaos Engineering would help simulate these types of failures in a safe environment.
Chaos Engineering Experiments follows 5 major steps
- Collect Metrics (Define Steady State)
Start by identifying how your system behaves under normal, healthy conditions. This is your steady state. The baseline that you will compare everything against.
- Form a Hypothesis
Make an assumption about how your system should behave during a failure. The hypothesis sets expectations and helps define success or failure.
- Design the Experiment
Plan a controlled failure scenario that tests the hypothesis. The experiment should be small, measurable, and focused.
- Measure the Impact
Monitor your system to see how it responds during and after the chaos. Use metrics, logs, alerts, and dashboards to assess the system’s behavior.
- Understand System Behavior & Learn
Analyze the results to identify weak spots, misconfigurations, or unexpected behaviors. Use this to improve your system’s resilience.
Popular solutions to implement chaos engineering
Netflix's ChaosMonkey
Gremlin
Azure Chaos Studio
AWS Fault Injection Serice (AWS FIS)
AWS Fault Injection Simulator (FIS) enables you to run real world failure scenarios on your AWS workloads in a controlled environment. By intentionally injecting faults such as network latency, server termination, or API throttling. FIS helps uncover hidden weaknesses so you can build more resilient and fault-tolerant systems.
FIS integrates seamlessly with other AWS services like CloudWatch Alarms, Amazon RDS, IAM, EC2, and more allowing you to safely automate experiments while maintaining full visibility, control, and security.
Architecture
Subscribe to our newsletter to get updated when we post new content on chaos engineering
Subscribe to my newsletter
Read articles from Oshaba Samson directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Oshaba Samson
Oshaba Samson
I am a software developer with 5 years + experience. I have working on web apps ecommerce, e-learning, hrm web applications and many others