One Framework to rule them all. Root Cause Analysis.

Harsh RathiHarsh Rathi
7 min read

Root cause analysis of a problem can be a daunting undertaking unless performed under a structure. With a battery of approaches available, I see people making their own arsenal into an antagonizing force. Here is a framework that I have time and again seen working for root cause analysis problems of many different flavors.

Defining boundaries is the most important step in making a wide problem statement manageable. With software being conceptually as wide as it is, and with as many possible causes of something happening, it is essential for us to be systematic in our diagnosis. One cannot start exploring 10 different areas and 20 probable causes all at once and hope to distill all that information within our mental buffers and spit out the most probable cause.

Software and the surrounding ecosystem around it is easily one of the most elegant interconnecting system of entities that humans have come up with. It may have only been a few decades since the human species has created and has been developing this ecosystem; But despite being a relatively novel phenomenon, I still think of it as an extension of everything we are without it. I encourage you to approach software as you would a thing of beauty. I assure you your exercise of cause analysis will become an experience of joy. For software is beautiful and on that account, our diagnosis shall be a journey of inward and outward exploration.

Note that the framework is composed of different buckets of assorted questions. You can cherry-pick as per applicability and with time add your own questions to the buckets. Still, worry not, I have made effort to cover most scenarios that you will generally face. If not to the word, at least a version of these questions will apply to your situation and you should be able to handle most analysis situations with this resource. Root cause analysis questions by their nature also thrive on re-confronting responses with WHYs. So it’s always a good idea to repeatedly ask Why questions until there are no more reasonable WHYs left (Read more: 5 Whys).

Before we start our journey of exploration, we must evaluate the ground upon which we stand. Do this by confronting the problem with an assortment of questions such as:

  • What used to calculate this metric historically? Has it changed?

  • Was the metric calculation formula reoriented in any way?

  • What is the accuracy, reliability, and uniformity of the measurement system?

The outward exploration

In May 2003, an election candidate in Belgium received 4096(2 raised 12) more votes than what should have been. This also happened to be a mathematical impossibility. The elections had used computers for polling. The cause for all of this was eventually determined to be ionizing cosmic radiation. Even though we won’t find this to be the cause in our everyday problems (we also now have protections in place to protect our systems from such phenomena), It still illustrates how out-of-the-world can causes be for our problem statements. And we don’t know till we know.

Now, confront your problem with questions within buckets of limited and no control.

Limited Control:

  • Competitors acted in a way that might have had an effect on our metric.

    • Feature updates by competitors.

    • Pricing recalibration by competitors.

    • Marketing influences by competitors.

    • New competitors in the market.

    • Competitor penetration through novel channels and/or integrations.

  • External Discovery channel malfunction.

    • Search engines not displaying my results accurately.

    • Marketplace down/malfunction.

    • Social media ranking algorithm changes if my metric depends on it.

  • Market trend Changes

    • Covid abruptly changed remote job preference for jobs. Did my job portal support that?

    • Tip-over of gradual market shifts in a direction.

    • News breakouts.

    • Standardization of industry practice(s) or the opposite.

  • Seasonal Changes

    • Holiday effect

    • Purchasing habits

  • Was a particular demographic segment affected?

    • by age

    • by Income

    • by gender

    • by Education

    • by ethnicity

    • by Religion

    • by family dynamics

    • by job title and/or status

    • by industry

    • by lifestyle, values, interests, beliefs, personality traits, motivations, etc.

    • By geography: zip code, city, area, country, climate, proximity to somewhere/something, etc.

  • Parent operating system changes.

    • Android changed its notification service policies.

    • Virtualization system not behaving in an expected way.

Note that sometimes it may be that the change in a metric is in a tolerable range and nothing may be required to be done. Sometimes in short term, you may be okay with change in a metric and might not want to dedicate resources to counter changes.

Out of Control:

  • Political Changes

    • India banned several Chinese apps.

    • The Ukraine-Russia conflict saw Ukraine losing internet connectivity and Users’ priority shift.

    • Several companies pulled out of Russia because of the Ukraine-Russia conflict and other dependent businesses were affected.

    • Turkey blocked social media for several hours after Syria's airstrike.

  • Disasters taking down data centers.

Inward Exploration

As a product person, an inward journey through software shows you how past experience has educated your team to a technology stack, the platform upon which modules are built. How they combine to form features. How they address needs and form the entire product. You see not just how two things work together at a time, you see how synergy is maintained among all. And so you see the unique identity of a product and why it is not the same as any other collection of same features available in the market.

For the purpose of our exercise we shall confront the problem with questions that cover areas of the product’s internal attributes:

  • Newly launched product/features.

    • User cannibalization

    • improper user education/onboarding.

  • Specific devices/browsers getting affected.

    • Android specific, iOS Specific, Windows specific, etc.

    • Chrome specific, safari specific, Firefox specific, etc.

Note the overlap between Internal and Limited Control: Parent operating system changes are not actual overlaps. Here the cause of metric variation is handling of particular devices/browsers on our part, and not changes in the former themselves.

  • Patterns in feedback.

    • Bug reports’ analysis

    • Feedback analysis

Throughout this exercise, our goal has been to have enough question buckets that theoretically all possible root causes fall into one of our buckets. By doing this we ensure that nothing goes beyond the scope of our framework. We also make sure that our buckets do not overlap with each other. By ensuring this exclusivity, we make sure that when we narrow something down, we in fact narrow it down. If there is overlap in our buckets, we at best establish correlations and the narrowed down cause while may be true, something else also might be true. This is why all buckets must be explored within the context of the problem.

The importance of this understanding demands from me some extra commentary, and so I shall allocate some more word count to this.

We don’t know what we don’t know. Easy say, hard practice. One has to be cautioned towards mutual exclusivity among things. When we learn that our problem did not exist before a particular update version and that it only popped up after the update; we rush towards the intermediate conclusion of “It must have to do with the update” and we start looking for an answer only within said bounds. While the problem may have happened to emerge after an update it cannot be conclusively reduced as the reasoning. The update correlation at best increases the probability of the root cause being within that bucket but it just as well may not have anything to do with the update at all. A political struggle still may have been the root cause of your problem all along and you will have spent all your bandwidth exploring dependencies within the updated version. So I finally re-instate that even though you might be feeling the root cause being in one of the buckets that you have selected through confirmation, time must be allocated to specifically eliminate other buckets as not the place for your root cause.

This is somewhat similar to the MECE principle, but as I mentioned before, I like to think of it as a ruminative journey not about but through the software. With all that, I once again urge you to approach software as you would something exquisite. For this is a renaissance and your inherent desire to understand it will compel you to flirt questions with it to the point of assimilation.

0
Subscribe to my newsletter

Read articles from Harsh Rathi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harsh Rathi
Harsh Rathi