Chaos Engineering | Game Day

Amit HimaniAmit Himani
4 min read

What:

Chaos engineering game day is a practice that involves deliberately introducing failures and disruptions into a system to test its resilience and identify potential weaknesses. It is typically carried out by a cross-functional team that includes developers, operations personnel, and other stakeholders, who work together to plan and execute various scenarios.

During a chaos engineering game day, the team may use tools such as fault injection, traffic throttling, or network partitioning to simulate various failure scenarios. The team then observes how the system responds to these disruptions and takes note of any unexpected behaviors or failures. By doing so, they can gain valuable insights into the system's strengths and weaknesses, as well as identify areas that need improvement.

GOALS

The goal of a Chaos Game day is to proactively test and improve the resilience, stability, and reliability of a system by intentionally introducing failures and disruptions in a controlled environment. By doing so, the team can identify and address potential weaknesses, which in turn can lead to a more robust and reliable system that is better equipped to handle real-world scenarios. The ultimate aim is to improve the end user experience by reducing downtime incidents and increasing system availability and performance.

Team

In a typical GameDay, a team of engineers responsible for the development and support of an application assumes one of four roles: Owner, Coordinator, Reporter, or Observer.

The Owner role is responsible for the overall GameDay and has the authority to decide on the plan, schedule, and whether to stop the experiment.

Coordinators are in charge of preparing the GameDay, coordinating other participating roles, and initiating the attacks in the Gremlin application.

Reporters take notes and record key observations and results from the GameDay, which they then enter into the Gremlin application.

Observers collect data from Gremlin and any monitoring tools during the GameDay, inform other participants of key observations, and verify the results. It's possible to have multiple people play the role of Observer, while GameDays should only have one Owner, Coordinator, and Reporter.

In addition to technical personnel, non-technical stakeholders such as team leads and product managers can also participate in GameDays.

Agenda Of The Day

The agenda of a Chaos GameDay can vary depending on the goals and scope of the testing. However, a typical GameDay agenda might include the following steps:

  1. Preparation: The team prepares for the GameDay by defining the scope, objectives, and success criteria of the testing. They may also identify potential failure scenarios and create a plan for introducing them into the system.

  2. Simulation: The team introduces the failure scenarios into the system, using tools such as fault injection, traffic throttling, or network partitioning. They may also use monitoring tools to collect data on how the system responds to the failures.

  3. Observation: The team observes how the system responds to the simulated failures, taking note of any unexpected behaviors or failures. They may also collect data on the impact of the failures on the end user experience.

  4. Analysis: The team analyzes the data collected during the GameDay to identify any weaknesses or areas for improvement. They may also discuss and prioritize potential solutions.

  5. Action: Once the experiments are completed, it is advisable to review all the executed tests, beginning with the ones that produced unexpected outcomes. This review process is similar to analyzing an incident in a controlled and organized manner without causing any chaos. Any issues identified during the review should be filed in the bug tracker, ensuring that they get resolved and can be retested in future experiments.

Summary

Game days can help organizations to validate their solutions, especially when their products are targeted towards a small number of significant customers. By conducting game days during the software development or validation phases, we can create more robust systems that can better serve our clients.

Game days are designed to test and break systems, making them a distinct process from regular QA, which primarily focuses on ensuring that new features meet the requirements of our customers. At Arundo, we strive to build software that is not just "good enough" but is also resilient, robust, and capable of withstanding real-world scenarios

0
Subscribe to my newsletter

Read articles from Amit Himani directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amit Himani
Amit Himani

As a Senior Cloud Architect at a well-known product-based company, I possess a wealth of experience in hybrid cloud technologies and a passion for performance engineering, SRE, and Chaos engineering. In my leisure time, I take pleasure in staying abreast of emerging technologies and keeping up with industry trends. I also enjoy sharing my knowledge and insights with others by writing informative articles. I firmly believe in the significance of continuous learning and personal development to achieve success. Through my writing, I aspire to inspire and motivate others to pursue their own growth and professional aspirations. Thank you for visiting my blog. I hope you find the content informative and engaging. Please feel free to leave a comment or contact me if you have any queries or would like to connect.