Learning on your own mistakes: incident post-mortems explained
In the world of technology and business, incidents are bound to happen and one thing remains constant: incidents and failures are an inevitable part of the journey. But, it’s not the failures that define us; it’s what we learn from them and how we grow as a result. In the world of technology this process of learning from your mistakes is encapsulated in what we call “incidents post-mortems.” These post-mortems are like mirrors reflectiing the past, helping us see the imperfections, and guiding us towards a better, more robust future.
In this article, I will dive deep into the world of incident post-mortems and explore why they are an essential practice for any organisation that values resilience, efficiency and continuous improvement. I’ll discuss the key steps involved in the post-mortem process, from defining what an incident is to sharing the reviewed findings with your entire organisation.
What is an incident?
Let's begin by defining what I mean by an incident. What exactly is an incident in the context of technology? An incident is any event that leads to a failure or a decreased level of service, requiring immediate reaction. These incidents can vary in scope and impact, from minor disruptions to critical failures that can cost a business dearly.
Understanding the nature of incidents is crucial for organisations. It's not just about identifying what went wrong; it's about comprehending the implications of these incidents. The impact of an incident can be far-reaching, affecting customer satisfaction, financial stability, and the reputation of your brand. Hence, the first step in the incident post-mortem process is to define the incident, describe what happened, and analyse its impact.
The Post-Mortem Process
Incident post-mortems, or retrospectives, are a structured approach to analysing and learning from these incidents. The process is a structured way to assess what went wrong, why it happened, and how we can prevent similar incidents in the future. While this may seem straightforward, let's break down each step for a more in-depth understanding.
Describe the Overview
The incident post-mortem starts with a clear and concise overview of what happened and what was the impact. To make this information easily digestible, consider using graphs and visual representations to clearly project the impact. Graphs provide a visual narrative of the incident, making it easier for stakeholders to understand the scale of the problem.
Visual aids are not only useful for internal discussions but also for conveying information to non-technical stakeholders. They help in breaking down complex technical issues into understandable terms, making it easier for management and decision-makers to grasp the situation.
Timeline of Significant Events
In any incident, understanding the sequence of events is essential. The timeline of significant events, complete with timestamps and descriptions of what happened and who did what provides a chronological record of the incident. It's best to track this information during the actual incident as much as possible, as it ensures accuracy and minimises the risk of important details being forgotten.
A comprehensive timeline helps in pinpointing the moment when the incident occurred and the actions taken in response. This is invaluable for the subsequent stages of the post-mortem process, particularly in the root cause analysis.
Root Cause Analysis
The heart of any incident post-mortem is the root cause analysis. Here, we use a methodology like the "5 Whys" method to dig deep into the incident, asking a series of "why" questions to get to the root of the problem. By systematically asking why something happened, we can uncover the underlying issues that led to the incident.
The goal of the root cause analysis is not just to assign blame but to identify the systemic issues that allowed the incident to occur. This understanding is crucial for developing effective solutions that prevent similar incidents in the future.
Lessons Learned
While the root cause analysis focuses on the "whys," the lessons learned phase is about establishing principles and reasoning based on the incident. These are not action items but are more like guiding principles that can inform the creation of actionable solutions. For example, a lesson learned could be, "We should not use setting X in our case, because of Y and Z," which may lead to an action item like "Disable setting X everywhere."
Lessons learned help in shaping the future decision-making process. They provide a foundation for making informed choices and avoiding the same pitfalls.
Define Action Items
Action items are the tangible steps that need to be taken to address the issues identified in the incident post-mortem. It's important that these action items are realistic and implementable in a short amount of time. While it might be tempting to include action items like "rewrite everything from scratch," such broad strokes are not always practical.
Action items should be specific, measurable, achievable, relevant, and time-bound (SMART). This ensures that they can be realistically executed and tracked for progress.
Review with Knowledgeable Engineers
Critical practice in the incident post-mortem process is reviewing the findings with knowledgeable engineers within your organisation. Ideally, involve experts from outside your immediate team or department. This external perspective is valuable for challenging assumptions and ensuring a thorough understanding of the bigger picture.
When reviewing, focus on the key "whys" and avoid getting bogged down in minor details. The objective is to understand the systemic issues and develop solutions that have a broad impact.
Set Deadlines
The Countdown to Redemption: Introducing deadlines for post-mortem review ensures that the lessons learned are promptly integrated into your system, strengthening your tech arsenal for future challenges.
Think of deadlines as the ticking clocks in a suspenseful thriller, reminding your team of the urgency of addressing the issues.
Share Across the Organisation
Sharing your post-mortems is akin to storytelling. Imagine your organisation as a receptive audience, eager to learn from your narrative. Think of sharing post-mortems as passing on the wisdom of the incident to future generations, much like the oral tradition of storytelling.
Implement Action Items
Once the action items have been defined and reviewed, it's crucial to implement them. A best practice is to introduce deadlines for action items, based on their priority. For example, high-priority action items may have a 30-day deadline, while medium and low-priority items could have 60 and 90-day deadlines, respectively.
Prioritising and tracking the closure of action items is essential to ensure that the organisation benefits from the improvements identified in the post-mortem.
Track and Analyse
To gauge the effectiveness of your incident post-mortem process, it's important to track and analyze the bigger picture. Continuously monitor and analyse your progress. Picture your journey as an epic adventure, with each incident as a new chapter. This includes monitoring the number of incidents over time and the percentage of completed action items across incidents, teams, services, and so on.
Tracking these metrics provides insights into the overall health of your organisation's technology and how effectively it learns from its mistakes.
Wider Organizational Review
As a final step, consider picking the most impactful incidents for wider, possibly organisation-wide, reviews. These reviews can provide additional findings and allow for the sharing of valuable experiences. To maintain transparency and clarity, define criteria for selecting incidents that will undergo this in-depth review, such as "lost revenue greater than X" or "users impacted by more than Y%."
A Path to Continuous Improvement
Summing up, I'd like to highlight that incident post-mortems are a crucial tool for organisations looking to learn from their mistakes and continuously improve. By systematically analysing incidents, identifying root causes, and implementing actionable solutions, businesses can strengthen their systems, enhance customer satisfaction, and protect their reputation. The journey from incident to post-mortem is not without its challenges, but the rewards in terms of resilience, efficiency, and innovation are well worth the effort.
However, for this process to be effective, it must be embraced at all levels of the organisation. From initial resistance to management buy-in and streamlined processes, every facet plays a role in the success of incident post-mortems. As the tech world continues to evolve, it's those organisations that learn from their mistakes and adapt to change that will thrive in the long run.
So, embrace incident post-mortems, and turn your setbacks into stepping stones on the path to success!
Subscribe to my newsletter
Read articles from Andrey Stolbovsky directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by