Automating Disaster Simulations in the Pipeline with Game Day as Code

Modern systems rarely fail in isolation. They fail at the worst possible moment, in the least expected way, when multiple dependencies intersect. In the past, resilience was something you tested only occasionally, if at all. A quarterly ‘Game Day’ exercise might involve taking down a service in staging and seeing how people responded. The trouble is that today’s systems are not the same. They are dynamic, distributed, and updated continuously. Resilience cannot be something we check once in a while. It has to be woven into the delivery. That is where Game Day as Code comes in.

Instead of treating resilience testing as an occasional ritual, we treat it as part of our pipelines, automated, repeatable, and codified. Just as infrastructure is now managed as code and deployments are governed by GitOps, Game Days too can be expressed as declarative artefacts that the pipeline runs. The result is a world where every release isn’t just tested for functionality but also for its ability to withstand failure.

From Fire Drills to Game Days

The concept of a Game Day originated with large internet companies who realised that staging environments never quite matched production. They would deliberately break things in production, at low scale, to test both technical systems and human response. Netflix popularised this approach with Chaos Monkey, which randomly terminated instances to ensure engineers built for resilience. Over time, these Game Days evolved into structured exercises, with teams simulating outages ranging from database loss to entire data centre failures.

https://www.youtube.com/watch?v=hubE-wqGSzU

Yet, despite the obvious value, Game Days often remained sporadic. They required human coordination, manual triggers, and a willingness to accept potential disruption. This meant they were done occasionally, perhaps during quarterly resilience tests or annual incident simulations. For smaller companies, they often never happened at all. Game Day as Code seeks to change that. By expressing failure modes and simulations as code, we can automate them. Instead of running chaos once a year, we run it on every build, every deployment, every environment.

Why Code Matters

When you codify a practice, you make it repeatable. Infrastructure as Code succeeded because it turned manual runbooks into declarative definitions that could be versioned, peer reviewed, and deployed at the push of a button. Game Days benefit from the same discipline. A YAML file or JSON spec might define what failure to introduce, when, how long to run it, and what metrics to observe.

By writing failures as code:

They can be version controlled alongside application and infrastructure.
They can evolve with the system instead of relying on institutional memory.
They can be parameterised and reused in different environments.
They can be validated automatically as part of CI/CD.

In practice, this means that when a developer submits a pull request, they are not just tested for functionality and performance, but also for resilience against defined failure modes.

Embedding Game Days in the Pipeline

So how would you automate disaster simulations in a pipeline?

Define failure scenarios as code. A scenario might be “simulate loss of database connectivity for 60 seconds” or “introduce 300ms latency between service A and service B”. These are stored as artefacts in a repo, just like Terraform or Helm charts.
Trigger scenarios in CI/CD. After deploying to a staging or ephemeral environment, the pipeline runs selected chaos experiments. For critical services, a subset can even be run in production with safeguards.
Measure system response. The pipeline doesn’t just introduce failure; it checks whether the system recovers. Observability is key here. If error budgets, SLOs, or recovery times are violated, the pipeline fails.
Automated rollback or block. Just as a failing test blocks a release, a failing resilience check can trigger rollback or prevent promotion to higher environments.
Report and learn. Results are published as part of pipeline reports, showing not only test outcomes but also resilience scores over time.

The entire cycle is hands-off. Engineers define failures once, and pipelines re-run them continuously.

YAML Based Scenario

For this scenario, imagine we want to test that a service can handle a temporary cache outage. Our Game Day as Code spec might look like this:

apiVersion: gameday/v1
kind: Scenario
metadata:
  name: cache-outage
spec:
  target: redis-cluster
  action: stop
  duration: 60s
  successCriteria:
    - type: metric
      name: http_request_success_rate
      threshold: 95%
    - type: recovery
      maxTime: 120s

This declares that for 60 seconds, the Redis cluster should be stopped. The pipeline will verify that the application maintains at least 95% successful requests and that recovery occurs within two minutes. If either condition fails, the build fails.

Tools like LitmusChaos, Chaos Mesh, or Gremlin already provide the primitives to run such experiments. The key difference with Game Day as Code is that these definitions are checked in, versioned, and executed automatically inside CI/CD.

Testing People and Processes

Classic game days were not just about systems but also about people. How did teams respond to alerts? How quickly did they identify the problem? Did documentation help? With automation, it is tempting to focus purely on systems. But Game Day as Code can and should extend to people. For example, pipelines can trigger alerts during chaos experiments. PagerDuty or Teams notifications can simulate real incidents, and teams can be measured on response times. Runbooks can be tested automatically, if an alert points to outdated instructions, that becomes visible immediately. In this way, human processes are also validated regularly.

Guardrails

Running chaos on every build sounds terrifying. But safeguards exist:

Blast radius control: start small, with experiments confined to staging or non critical services.
Kill switches: always allow immediate stop of a scenario if it causes unintended impact.
Progressive rollout: run chaos first in CI, then staging, then production at small scale.
Error budgets as gatekeepers: only run chaos in production if error budgets are healthy.

With these guardrails, you can safely embed chaos in pipelines without jeopardising reliability.

Perhaps the greatest impact of Game Day as Code is cultural. When resilience checks are automated, they become normal. Engineers stop fearing outages because they encounter them daily in safe environments. Recovery becomes muscle memory, not panic. You shift from reactive firefighting to proactive design. Stakeholders gains confidence. Instead of hoping the system will survive a disaster, they know it has survived dozens of simulated ones during the last release cycle. The natural evolution of Game Day as Code is towards self-healing pipelines. A system that fails a chaos test could automatically generate a ticket, trigger a remediation script, or even propose code changes using AI driven analysis. In time, pipelines might not just simulate outages but also adapt architectures in response, ensuring resilience evolves continuously.