Basics of Machine Learning and Root Cause Analysis (Week 01)

Abhiraj BhattaAbhiraj Bhatta
6 min read

Basics of ML

I began my machine learning journey by understanding what the machine learns. We know the data we obtain in todays world can be fitted with various functions which help us to understand the trends in the data and allows us to make predictions. Mathematical functions such as the simple equation of a line given by:

$$Y=ax+b$$

consists of independent variables(Y), dependent variables(x) and parameters(a, b), these parameters of a function which decide various factors, for example in the case of a line “a” is the slope which gives the rate of change of x w.r.t. Y and “b” is the intercept which gives the values of Y when x=0, are in fact what the machine is trying to learn.

After understanding this we move on to the types of machine learning, which are supervised and unsupervised machine learning.

Supervised Learning: Given a dataset which consists of input features X and output Y(Labeled data) our goal is to learn a mapping from X to Y.

There are various types of supervised machine learning algorithms, one of the simplest examples is Linear regression where we are given one input X and label Y and we try to fit a straight line in this dataset to best approximate Y. A slightly more complex version of this is Multiple Linear Regression, where the value of Y depends on multiple input features \(X1,X2,X3,...\) so on.

Broadly we divide supervised learning problems into two types, Regression problems for datasets with continuous features and Classification type problems for datasets with discrete or categorical features, note that a single dataset may even include both categorical and continuous variables which is a very comment scenario in the real world. For example, a data set of student marks with the hours of study, time on social media as continuous features and gender as a categorical feature.

Unsupervised Learning: Given a unlabeled dataset, we try to figure out a pattern or trend that can be observed in the data. An example is the K-means clustering algorithm.

Google news uses unsupervised learning to allow it to group articles of similar topics together.

An interesting example of a unsupervised learning problem is the “cocktail party problem” where we are given a noisy room full of people speaking, and there are an array of microphones distributed around the room, the problem asks us to separate the distinct voices of the people from the overall noise. The algorithm used to solve this problem is called ICA or “Independent Component Analysis”.

Reinforcement Learning: Here instead of trying to minimize errors or find a particular trend, we use a system of “reward signals” until the model performs how we want it to.

Think about how we train a dog, if we give it a command and it follows it, we give the dog a “reward” in the form of a treat, but if it behaves poorly we don’t give the reward. Similarly in reinforcement learning, when the model gives an output that is desirable, we give it a reward signal and this process is done until we get a model which is sufficiently trained.

After familiarizing myself with these concepts I implemented a few examples for simple linear regression and multiple linear regression in python using libraries such as NumPy, pandas and Scikit-Learn I have attached code below:

# Simple Linear Regression (One input feature)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# 1. Generate simple 1D input data
X = np.array([[1], [2], [3], [4], [5]])  # Shape (5, 1)
# Above we make a 2-D array as scikit only accepts 2-D arrays
y = np.array([3, 4, 2, 5, 6])           # Shape (5,)

# 2. Create a LinearRegression model
model = LinearRegression()

# 3. Fit the model (this learns β₀ and β₁)
model.fit(X, y)

# 4. Print learned parameters
print("Intercept (β₀):", model.intercept_)
print("Slope (β₁):", model.coef_[0])

# 5. Predict outputs for the same X values
y_pred = model.predict(X)

# 6. Plot the actual points and regression line
plt.scatter(X, y, color='blue', label='Actual')
plt.plot(X, y_pred, color='red', label='Predicted Line')
plt.legend()
plt.title("Linear Regression with scikit-learn")
plt.xlabel("X")
plt.ylabel("y")
plt.show()
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
#For this example I downloaded a student marks data set from kaggle
#Our goal is to find the output(marks) from the given features(Study hours and No. of Courses)
df = pd.read_csv("Student_Marks.csv")
print(df.head())

X = df[['number_courses', 'time_study']]  #Features
y = df['Marks'] #Labels

#splitting dataset into training and testing sets (20% data for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train) 

print("Intercept (b0): ", model.intercept_)
print("Slopes (b1, b2): ", model.coef_)

y_pred = model.predict(X_test) # predict test data

print("R² score:", r2_score(y_test, y_pred)) #Tells us how much better our model is than avg line
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred))
#Lower MSE = Better accuracy

Root Cause Analysis

In a company setting we often times have to deal with millions upon millions of data entries which are stored in company data bases. A client could say, want to understand what are the factors that caused a sales drop for a particular season or a similar type of scenario. We analyze and find the root cause for such problems using the concept of Root Cause analysis (RCA).

RCA is basically a collective term to describe the techniques and tools that help us to analyze the root cause of a problem or issue.

Techniques include:

  1. Fishbone/Ishikawa diagram

    Also known as the cause and effect diagram, here we write the main problem as the head of the “fish” and add category fishbones from the spine based on factors like:

    • Man (People: sales team issues, training gaps)

    • Machine (Tech: CRM breakdowns, delays in data sync)

    • Method (Process: incorrect pricing logic, discounting errors)

    • Material (Product: inventory issues, defects)

    • Measurement (KPIs: misleading forecasts)

    • Environment (External: seasonal demand, competitor activity) where we further brainstorm.

  1. 5 Whys Analysis

    A simple iterative technique where you keep asking “Why?” until you reach the root cause.

    for example:

    • lets say sales dropped in North region.

    • Why? Orders were low.

    • Why? Distributors didn’t place stock.

    • Why? They had unsold inventory.

    • Why? The product packaging was changed.

    • Why? Marketing wasn’t aligned with the product team.

Root Cause: Lack of internal communication between teams.

  1. Change Analysis

    Here, we compare the baseline performance with the performance when the problem occurs, then analyze which factors changed and how each change might have contributed to the problem.

    Questions like:

    • What changed?

    • who changed it?

    • When and why was it changed?

are common.

  1. Barrier Analysis

    Focuses on what barriers (controls/safeguards) existed and why they failed or were absent. We first list which barriers were present to safe guard against the issue and should have prevented it then examine if it worked or why it failed.

    We then redesign or add barriers based on the gaps found.

  2. Pareto Analysis

    Pareto Analysis is a decision-making and prioritization technique based on the Pareto Principle—also known as the 80/20 Rule. It assumes that 80% of problems stem from 20% of causes.

    We make a pareto chart which contains a cumulative percentage line, and the steepest part of this curve gives us the 20% causes which are the reason for the problem.

0
Subscribe to my newsletter

Read articles from Abhiraj Bhatta directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhiraj Bhatta
Abhiraj Bhatta