Breezy Guide to Effective RCA Writing

Gaurav MassandGaurav Massand
2 min read

What is RCA and Why is it important?

System outages are an inevitable part of managing software systems. RCA, or Root Cause Analysis, is a systematic approach to identifying the reasons behind an outage and implementing solutions to prevent recurrence. It also involves analyzing lessons learned during the outage period and identifying process improvements

Public-facing RCAs and Internal RCAs

Public-facing RCAs are intended for customers, relevant stakeholders, regulatory bodies, etc. They contain a high-level summary of the issue, timeline, impact, and corrective measures taken. Here is an example of a public-facing RCA provided by CrowdStrike for a recent outage.

Internal RCAs are intended for team members within the organization. They are more detailed and technical focusing on actionable steps for internal teams.

This blog will focus on internal RCAs

Blame the process, Not the People

It's very important to write blameless RCAs. This fosters a culture of collaboration and continuous process improvement. For instance, an outage occurred due to an incorrect record update because of a wrong query. Instead of blaming the person, find gaps in the process to avoid the same mistake. You can introduce a process where the query is reviewed by peers before execution and tested on staging systems before being executed in production.

Crafting a Comprehensive Outage Timeline

A detailed Timeline of the outage provides a clear sequence of events, starting from the detection of the issue to the resolution and post-incident review. It also helps to identify if improving a process can improve resolution times. For example, an API is in a degraded state, but identification takes time because of lack of observability tools. This gives an insight to improve observability of the systems.

What can be done better?

This section helps turn insights into improvements. It outlines steps needed for improvement in turnaround time for resolution, preventing similar future outages promoting a proactive mindset.

Conclusion

RCA should be about more than compliance, it should be a tool for improving systems and preventing future issues. Every detailed RCA carves out a path to one less outage in the future.

0
Subscribe to my newsletter

Read articles from Gaurav Massand directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Gaurav Massand
Gaurav Massand