Fail Fast, Recover Faster: LSEG’s Chaos Engineering Experiments on AWS

In today’s economy, resiliency is a not just an option but a necessity. Specifically for business industries like finance, healthcare where every millisecond holds significant amount of value, matter of fact where outages can cost millions. In these scenarios system must not only perform, but also withstand unexpected failure.

That’s where “Chaos Engineering” comes in

This mode of engineering is the discipline of intentionally triggering a failure in the system to uncover weaknesses before they become catastrophic. Again, it’s not about breaking things for fun (anyway, no one is stopping you) but its more about learning how things break so you can design them to resist or recover from failure.

This idea was popularised by Netflix, it is a key ingredient for building resilient, distributed systems. It is much more beyond traditional testing by embracing the complexity of real-world infrastructure: that is fluctuating network latencies, misbehaving services, hardware faults and human errors.

Rather than waiting for the next outage, engineers now proactively can simulate those failure modes and measure how well the system holds itself.

Why It Matters More Than Ever?

In a world where your customers expect 24/7 availability and zero downtime, even the most robust cloud-native architectures are only as strong as their weakest dependency. As system grow more distributed, and as dependencies multiply, unpredictability becomes the norm. Chaos Engineering is not anymore nice-to-have but for mission critical platforms like LSEG (London Stock Exchange Group) It is a strategic necessity.

Breaking to Build Better

When you’re building and operating a critical infrastructure for global financial markets, there is no room for guesswork, especially when things can go very wrong.

That’s the reason in this blog we will dig deeper into one of AWS’s most insightful case studies, this one focused on the London Stock Exchange Group (LSEG) to actually understand how chaos engineering is being used in practice to strengthen resilience. Not a theory anymore but real-world testing on live systems, which is done smartly and safely before anything hits production.

Here is what I learned, and why it mattered.

As already discussed above modern cloud architectures are beautifully powerful but also incredibly complex. And as we all know with complexity comes uncertainty. Failure’s don’t announce themselves politely, they creep in from places which you didn’t think to check. Maybe a missed alert, a misconfigured service or a overloaded queue downstream. And, POW. Millions of dollars gone.

For LSEG, this wasn’t some experiment or a side-project but a critical exercise in improving resiliency and meeting regulatory compliance in a high-stake zero-tolerance industry.

Handshake: LSEG + AWS + Chaos

LSEG’s Post Trade Technology team partnered with AWS for a focused Experience Based Acceleration (EBA) event, which was specifically designed to run chaos engineering experiments on important workloads.

It was led by architecture function and supported by multiple cross-functional technical teams, it was not just about testing but about learning. They used AWS Fault Injection Service (FIS) to simulate real-world scenarios like instance failures, service degradation and latency spikes all within a safe, controlled environment.

AWS’s proven framework methodologies were followed, including the on detailed in their “Verify the Resilience of Your Workload Using Chaos Engineering” blog post.

What’s powerful here is not just the tooling but the intentional mindset. LSEG treated resilience as a living capability rather than a checkbox. They looked at it as

  • How their workloads were architected for failures.

  • How SOPs responded in real-time.

  • Where recovery automation needed to spun up

  • Where observability had blind spots.

Tools like AWS Resilience Hub helped them map out, monitor and manage resilience continuously. But it was chaos engineering that brought it to life, ironing out those edge-cases vulnerabilities that don’t show in traditional test cycles.

Let’s talk Architecture: How LSEG’s Architecture Handled Controlled Chaos

To test the resiliency you need to have a realistic architecture that reflects your production completely and that’s what LSEG did.

LSEG 3-Tier Hybrid Architecture


A Classic Yet Modern 3-Tier Hybrid Architecture

LSEG’s target system was a three-tier web application running in a hybrid cloud setup, spanning both in AWS and On-Premises infrastructure. At a glance, this is how it was architected.

Web Tier: Hosted in a public subnet on Amazon EC2 Auto Scalling Group, allowing horizontal and vertical scalability and self-healing behaviour.

App & Internal Services: Some services ran in a container within a separate VPC, think of it as a micro-service layer wrapped in security

Database Tier: Powered by Amazon RDS, which is running in a private subnet, tightly coupled with certain on-prem components

Using AWS Fault Injection Service (FIS) the team could simulate different failure scenarios across the stack, Lets understand in detail each component, and why it matters.

EC2 Instance Failures

This test mimics what would happen when an EC2 Instance suddenly stops or disappears, taking with it the application or container pods that were running in it. This is a great way to verify whether Auto Scaling replaces the failed node fast enough, Load Balancer reroute the traffic seamlessly or State and session management are handled properly (specifically for stateless apps). For containers workloads which are running in ECS or EKS, similar chaos test can target tasks and pods directly.

Simulated via aws:ec2:stop-instances or terminate-instances

RDS Failover & Reboot

Databases don’t crash often, but when they do, recovery time is everything. This test forces Amazon RDS into a failover or reboot event, helping LSEG to Measure Recovery Time Objective (RTO), Spot issues with databases syncs and replication and Validate alerts, metrics and failover automation. This is especially relevant for regulated industries, where data integrity and downtime tare strictly enforced.

Tested using FIS-induced failovers and reboots

Network Latency Injection

This is very crucial as this talks about the delay, not disruption, It works by injecting artificial latency, by this means you can uncover How microservices behave under slow network links, Whether retry logic is implemented properly and If user-facing SLAs degrade smoothly or catastrophically.

Simulated using AWSFIS-Run-Network-Latency with Linux tc (traffic control)

Network Connectivity Disruption

A complete or partial network failure is a real possibility, from route table misconfiguratio to region-level issues. These tests help you to uncover if Services fail gracefully when cut off from dependencies, DNS fallbacks and failover routes are configured correctly and does Metrics pick up on outages fast enough for action.

Tested using aws:network:disrupt-connectivity

EBS I/O Pauses: Disk-Level Failures

This type of failures are very sneaky, By pausing I/O Operations on Amazon EBS volumes, the team tested How well applications respond to stalled or failed storage, Whether retry logic prevents data loss or If IOPS limits or saturation create systemic bottlenecks. This was very useful in pressure testing the architecture during peak activity scenarios, where every millisecond counts.

Simulated using aws:ebs:pause-volume-io

What They Learned (And, Why It Matters?)

At the end of the experiment, LSEG walked away with two major wins:

  • Actionable Improvements
    They discovered clear areas to enhance recovery time, improve observability, and optimize failover automation, all things that directly feed into their regulatory compliance and user trust goals.

  • A Repeatable Chaos Engineering Toolkit
    More than just fixes, they now have a methodology. The entire cross-functional team can rerun these chaos drills regularly to stay ahead of unknowns.

Conclusion

London Stock Exchange Group’s chaos engineering event is not just a technical case study, but it is a definite blueprint for how high-stakes enterprises can move fast without breaking things, or better yet, by breaking things in a controlled smart way.

Reference

Our blog explores and rewrites this case from a practical, engineering-first perspective — helping teams not just understand the what, but also the why and how of building resilience using Chaos Engineering on AWS.

This deep-dive was inspired by a real-world case study published by AWS:
"London Stock Exchange Group uses chaos engineering on AWS to improve resilience"

3
Subscribe to my newsletter

Read articles from Shivam (Anirudh) Nandy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shivam (Anirudh) Nandy
Shivam (Anirudh) Nandy

I break things. A lot. But I fix them even better. I live and breathe open source, especially when it comes to scaling with Kubernetes, pushing the frontiers of Artificial Intelligence, and getting my hands dirty with system-level programming. My journey started deep in Theoretical and Advanced Mathematical Physics, because I couldn’t resist understanding what really happens inside the "brain" of a neural network — spoiler: it’s just math, but it feels like magic. Along the way, I developed an unhealthy obsession with database optimization, squeezing performance like my life depends on it. And if there’s one language that truly speaks to my soul, it’s RUSSSSTTTT — yes, I scream it like that because I love it that much. And of course, I fuel all of this with loud, unapologetic doses of Bengali rock music. It's the perfect background to break things fast, build them better, and maybe scare my neighbors a little.