Part 1: 11 Basic Machine Learning Techniques with Math Notation Friendly Explained
1. Linear Regression
Linear Regression is one of the simplest techniques for predicting a continuous outcome by modeling the relationship between an independent variable and a dependent variable with a straight line.
Key Concept:
The goal of linear regression is to find the line that best fits the data, minimizing the difference between observed and predicted values. This line, or regression line, is represented by the equation: \( y = \beta_0 + \beta_1 x + \epsilon \)
How to Read: "y equals beta-zero plus beta-one times x plus epsilon."
Explanation of Notation:
\( y \) : The predicted or dependent variable (e.g., test score).
\( x \) : The independent variable (e.g., hours studied).
\( \beta_0 \) : The intercept, indicating the value of \( y \) when \( x = 0 \) (e.g., the score you’d get with zero study hours).
\( \beta_1 \) : The slope, showing how much \( y \) changes for each one-unit increase in \( x \) (e.g., how much each hour of study boosts the score).
\( \epsilon \) : The error term, accounting for the difference between the observed and predicted values.
How Linear Regression Works:
Fit the Line: The algorithm calculates \( \beta_0 \) and \( \beta_1 \) to minimize the squared differences between observed and predicted values.
Make Predictions: Using this best-fit line, predictions can be made for \( y \) based on new \( x \) values.
Real-Life Example and Interpretation:
Suppose you’re trying to model the relationship between hours studied and test scores, and the resulting regression equation is: \( y = 50 + 10x \)
Assume:
\( x \) (hours studied) ranges from 0 to 5.
\( y \) (test score) is on a scale of 0 to 100.
\( \beta_0 = 50 \) is your baseline score without any study time.
\( \beta_1 = 10 \) means each extra hour of study boosts the score by 10 points.
Output Interpretation:
In this example, a student who studies for 3 hours would have a predicted score of: \( y = 50 + 10 \cdot 3 = 80 \)
Friendly Explanation:
Imagine \( \beta_0 \) is like a “bonus” score for simply showing up to the test, and \( \beta_1 \) is your “study power-up,” adding 10 points for each additional hour of study. So, with 3 hours of studying, it’s like saying, “You’re predicted to score an 80—not bad for putting in some effort!”
2. Ridge Regression
Ridge Regression is a type of linear regression that adds regularization to the model to prevent overfitting. It’s particularly helpful when there are many predictors or when predictors are highly correlated, as it reduces the impact of any single predictor, making the model more stable.
Key Concept:
Ridge regression modifies the linear regression equation by adding a penalty term (regularization) that discourages large coefficients. This results in the equation: \( y = \beta_0 + \beta_1 x + \lambda \sum_{j=1}^p \beta_j^2 \)
How to Read: "y equals beta-zero plus beta-one times x plus lambda times the sum from j equals one to p of beta-sub-j squared."
Explanation of Notation:
\( y \) : The predicted outcome, such as a student’s final grade.
\( x \) : The independent variables, like study hours or number of practice tests.
\( \beta_0 \) : The intercept term, representing the starting value of \( y \) when all \( x \) values are zero (e.g., the baseline grade).
\( \beta_j \) : The coefficients for each predictor \( j \) (e.g., how each additional study hour impacts the final grade).
\( \lambda \) : The regularization parameter that controls the penalty on large coefficients (e.g., \( \lambda = 0.5 \) ).
\( p \) : The total number of predictors in the model.
How Ridge Regression Works:
Fit the Line with Regularization: The algorithm calculates \( \beta_0 \) and \( \beta_j \) values by minimizing the sum of squared residuals, plus the penalty term \( \lambda \sum_{j=1}^p \beta_j^2 \) .
Control Overfitting: By adjusting \( \lambda \) , the model shrinks large coefficients, reducing the risk of overfitting and making the model more robust when predictors are highly correlated.
Real-Life Example and Interpretation:
Suppose you’re predicting a student’s final grade based on three study habits:
\( x_1 \) \= hours studied,
\( x_2 \) \= number of practice tests,
\( x_3 \) \= hours slept.
The Ridge Regression equation might look like this: \( y = 30 + 5x_1 + 3x_2 + 2x_3 + \lambda \sum_{j=1}^3 \beta_j^2 \)
Assume:
\( \lambda = 0.5 \) (a moderate penalty to shrink large coefficients).
Example values: \( x_1 = 4 \) hours of study, \( x_2 = 2 \) practice tests, \( x_3 = 7 \) hours of sleep.
Calculation Train of Thought:
Apply Ridge Formula: Plugging the values into the Ridge Regression formula: [ y = 30 + 5(4) + 3(2) + 2(7) + 0.5 \left(5^2 + 3^2 + 2^2\right) ]
Calculate Each Term:
Main equation without penalty: \( 30 + 5(4) + 3(2) + 2(7) = 30 + 20 + 6 + 14 = 70 \) .
Penalty term: \( 0.5 \left(5^2 + 3^2 + 2^2\right) = 0.5 \times (25 + 9 + 4) = 0.5 \times 38 = 19 \) .
Combine Terms: [ y = 70 - 19 = 51 ]
Output Interpretation:
With these inputs, the predicted final grade is 51. The penalty term ( \( \lambda \) ) reduced the predicted score from 70 to 51, demonstrating how Ridge Regression minimizes large coefficient impacts.
Friendly Explanation:
Think of \( \lambda \) as a "balancing factor" reminding the model, "Don't let any one study habit (like hours studied) completely take over!" By applying this penalty, Ridge Regression gives a balanced view, considering all habits while preventing any one from dominating the prediction.
3. Lasso Regression
Lasso Regression (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that adds regularization to prevent overfitting. Unlike Ridge Regression, which penalizes large coefficients by shrinking them, Lasso Regression can shrink some coefficients all the way to zero, effectively selecting the most important predictors and discarding the rest.
Key Concept:
Lasso regression modifies the linear regression equation by adding an absolute value penalty term, leading to the following equation: \( y = \beta_0 + \beta_1 x + \lambda \sum_{j=1}^p |\beta_j| \)
How to Read: "y equals beta-zero plus beta-one times x plus lambda times the sum from j equals one to p of the absolute value of beta-sub-j."
Explanation of Notation:
\( y \) : The predicted outcome, such as a student’s test score.
\( x \) : The independent variables, like hours studied or number of practice tests.
\( \beta_0 \) : The intercept term, representing the starting value of \( y \) when all \( x \) values are zero.
\( \beta_j \) : The coefficients for each predictor \( j \) (e.g., the effect each study habit has on the test score).
\( \lambda \) : The regularization parameter (penalty term) that controls the degree of shrinkage. Higher \( \lambda \) values can shrink some coefficients to zero.
\( p \) : The number of predictors in the model.
How Lasso Regression Works:
Fit the Line with Regularization: The algorithm calculates \( \beta_0 \) and \( \beta_j \) values while minimizing the sum of squared residuals, plus the penalty term \( \lambda \sum_{j=1}^p |\beta_j| \) .
Feature Selection: With large enough \( \lambda \) , some coefficients \( \beta_j \) shrink to zero, removing less important predictors from the model and simplifying it.
Real-Life Example and Interpretation:
Suppose you’re predicting a student’s final grade based on study habits:
\( x_1 \) \= hours studied,
\( x_2 \) \= number of practice tests,
\( x_3 \) \= hours of sleep,
\( x_4 \) \= participation in study groups.
The Lasso Regression equation might look like this: \( y = 40 + 8x_1 + 3x_2 + 0x_3 + 5x_4 + \lambda \sum_{j=1}^4 |\beta_j| \)
Assume:
\( \lambda = 1 \) , a regularization parameter that can shrink less important coefficients to zero.
Example values: \( x_1 = 3 \) hours of study, \( x_2 = 2 \) practice tests, \( x_4 = 1 \) study group session ( \( x_3 = 0 \) ).
Calculation Train of Thought:
Apply Lasso Formula: [ y = 40 + 8(3) + 3(2) + 0(0) + 5(1) + 1 \times (|8| + |3| + |0| + |5|) ]
Calculate Each Term:
Main equation without penalty: \( 40 + 8(3) + 3(2) + 0 + 5(1) = 40 + 24 + 6 + 5 = 75 \) .
Penalty term: \( 1 \times (|8| + |3| + |0| + |5|) = 1 \times (8 + 3 + 0 + 5) = 16 \) .
Combine Terms: [ y = 75 - 16 = 59 ]
Output Interpretation:
With these inputs, the predicted final grade is 59. The Lasso penalty term ( \( \lambda \) ) pushed the effect of “hours of sleep” to zero, meaning that this predictor is not considered important in predicting the final grade. This is Lasso Regression’s unique ability to simplify models by excluding irrelevant features.
Friendly Explanation:
Think of \( \lambda \) as a “simplicity filter,” encouraging the model to focus only on study habits that strongly affect grades. It’s like saying, “If sleep doesn’t make much difference here, let’s just ignore it!” This way, Lasso keeps the model simpler and more interpretable by cutting out less impactful factors.
4. Logistic Regression
Logistic Regression is used for binary classification tasks, where the goal is to predict the probability of an outcome belonging to one of two classes (e.g., "pass" or "fail"). Instead of a straight line, Logistic Regression fits an "S"-shaped curve called the sigmoid function, which outputs probabilities between 0 and 1.
Key Concept:
Logistic Regression estimates the probability that a given input belongs to a particular class using the sigmoid function. The equation is: \( P(y = 1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} \)
How to Read: "The probability that y equals 1 given x is one over one plus e to the power of negative beta-zero plus beta-one times x."
Explanation of Notation:
\( y \) : The binary outcome (e.g., "pass" = 1, "fail" = 0).
\( x \) : The predictor variable (e.g., hours studied).
\( \beta_0 \) : The intercept term, representing the baseline log-odds of the outcome when \( x = 0 \) .
\( \beta_1 \) : The coefficient showing the effect of \( x \) on the log-odds of the outcome.
\( e \) : Euler’s number, approximately equal to 2.718, used for exponential functions.
How Logistic Regression Works:
Fit the Sigmoid Curve: Logistic Regression calculates \( \beta_0 \) and \( \beta_1 \) to fit a sigmoid curve that best separates the two classes.
Output Probabilities: The model then uses this curve to predict the probability of the outcome for new inputs.
Real-Life Example and Interpretation:
Suppose you’re using Logistic Regression to predict whether a student will pass an exam based on hours studied:
\( x \) \= hours studied.
\( \beta_0 = -2 \) represents a low baseline probability of passing with no study time.
\( \beta_1 = 0.8 \) means each additional hour of study increases the log-odds of passing.
Let’s predict the probability of passing for a student who studied for 3 hours.
Calculation Train of Thought:
Apply the Logistic Formula:
\( P(y = 1 | x = 3) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot x)}} \)Substituting \( \beta_0 = -2 \) , \( \beta_1 = 0.8 \) , and \( x = 3 \) :
\( P(y = 1 | x = 3) = \frac{1}{1 + e^{-(-2 + 0.8 \cdot 3)}} \)Calculate Inside the Exponent:
\( -2 + 0.8 \cdot 3 = -2 + 2.4 = 0.4 \)
So, \( e^{-0.4} \approx 0.67 \) .
Complete the Calculation:
\( P(y = 1 | x = 3) = \frac{1}{1 + 0.67} = \frac{1}{1.67} \approx 0.60 \)
Output Interpretation:
The model predicts a 60% probability of passing for a student who studied for 3 hours. Logistic Regression interprets this result in terms of probability, so with 3 hours of study, the student has more than a 50% chance of passing.
Friendly Explanation:
Imagine the sigmoid function as a “probability switch” that starts near 0 when study hours are low and rises steeply toward 1 as study hours increase. With Logistic Regression, each hour of study pushes the switch further toward “pass,” helping you gauge how likely it is to succeed based on preparation.
5. Gradient Descent
Gradient Descent is an optimization algorithm used to minimize functions, most commonly applied to finding the best parameters in machine learning models, like the coefficients in Linear or Logistic Regression. The goal of Gradient Descent is to iteratively adjust parameters in the direction that reduces the model’s error.
Key Concept:
Gradient Descent finds the minimum of a function by taking steps proportional to the negative of the gradient (slope) of the function at the current point. In simple terms, it’s like rolling downhill toward the lowest point on a curve.
For a function \( f(\theta) \) , Gradient Descent updates the parameters as follows: \( \theta := \theta - \alpha \nabla f(\theta) \)
How to Read: "Theta equals theta minus alpha times the gradient of f at theta."
Explanation of Notation:
\( \theta \) : The model parameters being optimized (e.g., \( \beta_0 \) and \( \beta_1 \) in regression).
\( \alpha \) : The learning rate, controlling the size of each step.
\( \nabla f(\theta) \) : The gradient of the function \( f \) with respect to \( \theta \) , indicating the slope or direction of steepest ascent.
How Gradient Descent Works:
Initialize Parameters: Start with an initial guess for the parameters \( \theta \) .
Calculate Gradient: Compute the gradient \( \nabla f(\theta) \) , which tells us the slope of the function at the current parameters.
Update Parameters: Adjust \( \theta \) by taking a small step in the opposite direction of the gradient, controlled by \( \alpha \) .
Repeat: Continue updating \( \theta \) until the algorithm converges (i.e., further updates make only a minimal change).
Real-Life Example and Interpretation:
Suppose you’re adjusting the height of a ball on a bumpy hill to reach the lowest point (minimum). Each move is based on how steep the slope is under the ball:
If the slope is steep, you’ll take a bigger step (more confident that you’re going downhill).
If the slope is shallow, you’ll take a smaller step to avoid overshooting.
Assume:
\( \alpha = 0.1 \) , a moderate learning rate for adjusting the steps.
Starting with a height of \( \theta = 10 \) on the hill.
Calculation Train of Thought:
Compute Gradient at Current Height:
Let’s say the gradient \( \nabla f(\theta) = 4 \) at \( \theta = 10 \) .Update the Height Using Gradient Descent Formula:
\( \theta := \theta - \alpha \nabla f(\theta) \)
Substituting values:
\( \theta := 10 - 0.1 \cdot 4 \)
\( \theta = 10 - 0.4 = 9.6 \)Repeat Until Convergence:
Each update will bring \( \theta \) closer to the minimum height, taking smaller steps as the slope becomes shallower.
Output Interpretation:
After multiple iterations, \( \theta \) will settle near the minimum of the function, similar to the ball eventually resting at the bottom of the hill.
Friendly Explanation:
Imagine Gradient Descent as walking downhill in the dark. You feel the slope beneath your feet, taking careful steps in the steepest direction. The learning rate \( \alpha \) is like the size of your steps: big enough to move quickly, but small enough to avoid overshooting. Gradually, you’ll reach the bottom!
6. Support Vector Machines (SVM)
Support Vector Machines (SVM) is a supervised learning algorithm used primarily for classification tasks. SVM aims to find the optimal boundary (or hyperplane) that best separates data points of different classes with the maximum margin. This makes it a powerful tool for distinguishing between two groups, even in high-dimensional spaces.
Key Concept:
In SVM, the algorithm finds the hyperplane that maximizes the margin between the closest points of each class, known as support vectors. The goal is to create as much separation as possible between classes, so new data points can be classified accurately.
For a linear SVM, the decision boundary is given by: \( f(x) = \textbf{w} \cdot \textbf{x} + b = 0 \)
How to Read: "f of x equals w dot x plus b equals zero."
Explanation of Notation:
\( \textbf{w} \) : The weight vector, representing the orientation of the hyperplane.
\( \textbf{x} \) : The input vector (data points being classified).
\( b \) : The bias term, controlling the offset of the hyperplane from the origin.
Support Vectors: Data points that are closest to the hyperplane, defining the margin width.
How SVM Works:
Find Support Vectors: SVM identifies the data points nearest to the decision boundary. These are the support vectors.
Maximize Margin: The algorithm maximizes the margin between support vectors of different classes, creating a wide “buffer zone” for classification.
Compute Decision Boundary: SVM calculates the optimal hyperplane \( f(x) = 0 \) , which best separates the classes.
Real-Life Example and Interpretation:
Imagine you’re organizing books by genre (e.g., fiction vs. non-fiction) on a shelf, and you want a clear separation between them. The boundary line between genres should have enough space so that each genre’s books are easily distinguishable from the other.
Assume:
Fiction books are marked as class +1.
Non-Fiction books are marked as class -1.
The SVM finds the boundary line \( f(x) = \textbf{w} \cdot \textbf{x} + b = 0 \) that best separates these genres, with support vectors defining the boundary.
Calculation Train of Thought:
Identify Closest Points: Find the closest fiction and non-fiction books to the boundary line.
Compute Hyperplane and Margin: SVM adjusts the hyperplane’s position and orientation to maximize the distance (margin) from these books to the boundary.
Classify New Books: Once the boundary is established, new books are classified based on their position relative to this line.
Output Interpretation:
After training, the SVM produces a decision boundary that clearly separates the classes with the widest possible margin. For any new book, the model simply checks which side of the boundary it falls on to classify it.
Friendly Explanation:
Think of SVM as a “genre divider” for your bookshelf. The support vectors are the books closest to the boundary, defining how wide your divider can be without touching either genre. This boundary helps make sure new books can be easily sorted by genre without any confusion.
7. K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a simple and intuitive supervised learning algorithm used for classification and regression tasks. KNN classifies a new data point by finding the “K” closest labeled points in the training set and assigning it the most common label among these neighbors.
Key Concept:
KNN assigns a label to a new data point by identifying its K nearest neighbors (the closest data points in the training set) and choosing the most common label among them.
To find the distance between a new data point \( x \) and a training point \( x_i \) , KNN often uses the Euclidean distance formula: \( d(x, x_i) = \sqrt{\sum_{j=1}^p (x_j - x_{ij})^2} \)
How to Read: "The distance between x and x-sub-i equals the square root of the sum of the squared differences for each feature j from 1 to p."
Explanation of Notation:
\( x \) : The new data point we want to classify (e.g., a customer who spends $8 per visit and visits twice a week).
\( x_i \) : A data point in the training set (e.g., an existing customer with known spending and visit frequency).
\( d(x, x_i) \) : The Euclidean distance between \( x \) and \( x_i \) , representing how “close” the new customer is to each existing customer.
\( p \) : The number of features in each data point (e.g., 2 features: average spending and visit frequency).
How KNN Works:
Calculate Distances: For the new data point, calculate the distance to each point in the training set.
Identify Nearest Neighbors: Select the K points with the smallest distances.
Classify Based on Majority Vote: For classification, the most common label among the K nearest neighbors is assigned to the new point.
Real-Life Example and Interpretation:
Imagine you’re trying to predict whether a customer will buy a certain type of coffee based on their behavior:
K = 3 (we’ll consider the 3 closest customers in terms of similarity).
Training data includes past customers labeled as “buy” or “not buy,” with information on average spending and frequency of visits.
Assume:
New customer’s data: Spends around $8 per visit and visits twice a week.
Three closest neighbors: Two labeled “buy” and one labeled “not buy.”
Calculation Train of Thought:
Compute Distance for Each Neighbor:
Using the Euclidean distance formula, calculate the distance between the new customer’s behavior \( (x_1, x_2) = (8, 2) \) and each of the three closest training customers.For example, if the first closest neighbor has values \( (x_{1i}, x_{2i}) = (10, 3) \) (spends $10 per visit, visits 3 times a week), we calculate: \( d(x, x_i) = \sqrt{(x_1 - x_{1i})^2 + (x_2 - x_{2i})^2} = \sqrt{(8 - 10)^2 + (2 - 3)^2} \)
Breaking it down: \( d(x, x_i) = \sqrt{(-2)^2 + (-1)^2} = \sqrt{4 + 1} = \sqrt{5} \approx 2.24 \)
Repeat this for the other two neighbors to determine which are the closest.
Select K Nearest Neighbors:
After calculating distances, identify the 3 closest customers. Suppose the three neighbors have distances of approximately 2.24, 1.5, and 3.Classify Based on Majority Vote:
Since 2 out of the 3 nearest customers purchased the coffee, the model predicts this customer is likely to buy it.
Output Interpretation:
The model predicts that the new customer will buy the coffee based on similar customers’ purchasing behavior. KNN essentially says, “If you’re like the customers who bought coffee, you’re likely to buy too!”
Friendly Explanation:
Think of KNN as getting advice from similar shoppers. If the new customer asks, “Would I probably buy this coffee?” KNN finds the three customers who shop most like them and “votes” based on their behavior. If most similar shoppers bought the coffee, it’s a good guess that this customer will too.
where .
M
How to Read: "Z-sub-i equals w-sub-i dot X."
Explanation of Notation:
- : Toint.
8. Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is an unsupervised learning technique used for dimensionality reduction. It transforms a dataset with many features into a smaller set of “principal components” that retain as much information as possible. PCA is commonly used to simplify datasets, making them easier to analyze and visualize.
Key Concept:
PCA finds the principal components, which are directions in the feature space along which the data varies the most. Each component is a linear combination of the original features and captures the maximum variance in the data. The goal is to project data onto fewer dimensions without losing significant information.
For a dataset with features \( X_1, X_2, \dots, X_p \) , PCA finds principal components \( Z_1, Z_2, \dots, Z_k \) where \( k < p \) .
Mathematical Representation:
The principal components are obtained by finding the eigenvectors and eigenvalues of the covariance matrix of the dataset. Each principal component \( Z_i \) is given by: \( Z_i = \textbf{w}_i \cdot \textbf{X} \)
How to Read: "Z-sub-i equals w-sub-i dot X."
Explanation of Notation:
\( Z_i \) : The (i)-th principal component.
\( \textbf{w}_i \) : The eigenvector corresponding to the (i)-th largest eigenvalue, representing the direction of maximum variance.
\( \textbf{X} \) : The original feature vector for a data point.
How PCA Works:
Standardize the Data: Center the dataset by subtracting the mean from each feature.
Calculate Covariance Matrix: Find the covariance matrix to understand relationships between features.
Compute Eigenvectors and Eigenvalues: Identify the eigenvectors and eigenvalues of the covariance matrix, which represent the directions and magnitudes of the principal components.
Project onto Principal Components: Transform the original data to a new basis defined by the top ( k ) eigenvectors (principal components).
Real-Life Example and Interpretation:
Suppose you have a customer dataset for a retail store with four features:
\( X_1 \) : Customer age,
\( X_2 \) : Annual income,
\( X_3 \) : Purchase frequency,
\( X_4 \) : Average amount spent per visit.
Using PCA, you want to reduce this dataset from 4 features to 2 principal components that capture the main trends in customer behavior.
Calculation Train of Thought:
Center the Data: Subtract the mean of each feature from the dataset to standardize it. For example, if the mean of annual income ( \( X_2 \) ) is $40,000, subtract 40,000 from each income entry.
Compute Covariance Matrix: Calculate the covariance matrix of the centered data to understand how features vary together. For instance, the covariance between purchase frequency ( \( X_3 \) ) and amount spent ( \( X_4 \) ) might reveal if frequent shoppers also tend to spend more.
Calculate Eigenvectors and Eigenvalues:
Find the eigenvectors ( \( \textbf{w}_1, \textbf{w}_2, \dots, \textbf{w}_p \) ) and eigenvalues of this covariance matrix.
Suppose \( \textbf{w}_1 \) , the first eigenvector, represents a strong pattern involving income and amount spent, indicating these are primary factors in customer behavior.
Select and Project:
Choose the eigenvectors with the largest eigenvalues (let’s say \( \textbf{w}_1 \) and ** \( \textbf{w}_2 \) ) to capture the maximum variance.
Project each original customer data point ( \( \textbf{X} \) ) onto these two eigenvectors, calculating each principal component as \( Z_1 = \textbf{w}_1 \cdot \textbf{X} \) and \( Z_2 = \textbf{w}_2 \cdot \textbf{X} \) .
Output Interpretation:
The output is a transformed dataset with only 2 dimensions ( \( Z_1 \) and \( Z_2 \) ), retaining the most essential information. This reduced dataset reveals the primary factors influencing customer behavior, such as income and spending patterns, and is now easier to analyze or visualize.
Friendly Explanation:
Think of PCA as condensing a complex dataset into a quick summary. For customer data, PCA might tell you that spending habits and income are the key trends, letting you focus on these without losing important insights.
9. Naive Bayes Classification
Naive Bayes Classification is a probabilistic algorithm used primarily for classification tasks. It applies Bayes’ Theorem with the assumption that features are independent of each other, hence the term “naive.” This algorithm is effective for tasks like spam detection, where it classifies messages based on the probabilities of words appearing in spam vs. non-spam emails.
Key Concept:
Naive Bayes uses Bayes' Theorem to calculate the probability of a data point belonging to each class, then selects the class with the highest probability. Bayes’ Theorem states: \( P(C | X) = \frac{P(X | C) \cdot P(C)}{P(X)} \)
How to Read: "The probability of C given X equals the probability of X given C times the probability of C divided by the probability of X."
Explanation of Notation:
\( P(C | X) \) : The probability that a data point belongs to class \( C \) given the features \( X \) .
\( P(X | C) \) : The probability of observing features \( X \) given that the data point belongs to class \( C \) .
\( P(C) \) : The prior probability of class \( C \) (e.g., the overall probability of spam).
\( P(X) \) : The probability of observing the features \( X \) , often ignored since it is the same across classes.
How Naive Bayes Works:
Calculate Priors: Estimate \( P(C) \) for each class using the training data.
Compute Conditional Probabilities: For each feature, calculate \( P(X_i | C) \) under the naive assumption that features are independent.
Apply Bayes’ Theorem: Compute \( P(C | X) \) for each class and choose the class with the highest probability.
Real-Life Example and Interpretation:
Imagine you’re classifying emails as spam or not spam based on words in the email. Suppose you observe the word “free” in a new email and want to predict if this email is spam.
Assume:
Classes: Spam (S) and Not Spam (NS).
Prior Probabilities: \( P(\text{S}) = 0.2 \) (20% of emails are spam), \( P(\text{NS}) = 0.8 \) .
Likelihoods:
\( P(\text{free} | \text{S}) = 0.7 \) : 70% of spam emails contain “free.”
\( P(\text{free} | \text{NS}) = 0.1 \) : 10% of non-spam emails contain “free.”
Calculation Train of Thought:
Calculate Posterior for Spam (S):
Using Bayes’ Theorem:
\( P(\text{S} | \text{free}) = \frac{P(\text{free} | \text{S}) \cdot P(\text{S})}{P(\text{free})} \)Plugging in values:
\( P(\text{S} | \text{free}) = \frac{0.7 \cdot 0.2}{P(\text{free})} = \frac{0.14}{P(\text{free})} \)Calculate Posterior for Not Spam (NS):
Similarly,
\( P(\text{NS} | \text{free}) = \frac{P(\text{free} | \text{NS}) \cdot P(\text{NS})}{P(\text{free})} \)Plugging in values:
\( P(\text{NS} | \text{free}) = \frac{0.1 \cdot 0.8}{P(\text{free})} = \frac{0.08}{P(\text{free})} \)Compare Posteriors:
Since \( P(\text{S} | \text{free}) > P(\text{NS} | \text{free}) \) , the model predicts that the email is more likely to be spam.
Output Interpretation:
Naive Bayes predicts that the email is spam based on the likelihood of seeing the word “free” in spam vs. non-spam emails. Despite its simplicity, Naive Bayes is highly effective for text classification.
Friendly Explanation:
Think of Naive Bayes as a “word detector” that looks for suspicious keywords to decide if an email is spam. Here, the word “free” raises a red flag, tipping the probability in favor of spam. The model just adds up these “red flags” and picks the category with the most flags.
10. Decision Trees
Decision Trees are a supervised learning algorithm used for both classification and regression. A decision tree splits data into branches, like a flowchart, with each decision node representing a question about a feature and each leaf node providing a final output or class label. Decision Trees are popular for their interpretability, as they mimic human decision-making.
Key Concept:
In a decision tree, data is split at each node based on a feature that maximizes the separation between classes. The algorithm selects the best split by using measures like Gini impurity or entropy to evaluate how well a split divides the data.
For classification, the goal is to minimize impurity at each node. One common metric is Gini impurity: \( G = 1 - \sum_{i=1}^k p_i^2 \)
How to Read: "G equals 1 minus the sum from i equals 1 to k of p-sub-i squared."
Explanation of Notation:
\( G \) : Gini impurity, measuring the “impurity” or randomness of data at a node.
\( k \) : The number of classes.
\( p_i \) : The probability of a data point belonging to class \( i \) at the node.
How Decision Trees Work:
Select the Best Split: At each node, evaluate all features and choose the split that reduces impurity the most.
Divide the Data: Split data into subsets based on the chosen feature.
Repeat Recursively: Continue splitting each subset until reaching a stopping criterion (e.g., maximum depth, minimum impurity).
Classify New Data: Use the constructed tree to make predictions by following the path down the tree based on the features of the new data point.
Real-Life Example and Interpretation:
Suppose a Decision Tree is used to predict if a customer will buy or not buy a product based on:
Age (e.g., under 30 or over 30)
Income (e.g., high or low)
Has Children (e.g., yes or no)
The tree starts with a root node that splits based on a feature (say, age) and branches out depending on further conditions.
Assume:
Under 30 customers are more likely to buy.
High-income customers over 30 are also likely to buy.
Low-income customers over 30 who don’t have children are unlikely to buy.
Calculation Train of Thought:
Calculate Gini Impurity for Each Split:
Suppose at the root, 50% of customers are buyers and 50% are non-buyers.
The Gini impurity for this node is: \( G = 1 - (0.5^2 + 0.5^2) = 1 - (0.25 + 0.25) = 1 - 0.5 = 0.5 \)Evaluate Split by Feature:
For example, splitting on age (under 30 or over 30) could create two subsets. Calculate the impurity for each subset and weigh it by the subset’s size. The feature that reduces impurity the most will be chosen for the split.Continue Splitting Until Impurity is Minimized:
The tree continues branching until it reaches a stopping point (like pure leaf nodes or maximum depth), forming paths for each customer type.
Output Interpretation:
The tree provides a flowchart-like structure for classifying new customers. A new customer’s path down the tree (e.g., age over 30, high income, has children) determines the final prediction.
Friendly Explanation:
Think of a Decision Tree as a series of questions guiding you to an answer. Imagine a store manager deciding if a customer will buy something: if they’re under 30, they might immediately be considered likely buyers. If they’re over 30, income and family status could help refine the prediction. Each step in the tree “zooms in” on the customer’s characteristics until the manager can make a confident guess.
11. Random Forests
Random Forests is an ensemble learning algorithm that builds multiple decision trees and combines their outputs to make more accurate and stable predictions. By creating a “forest” of trees, Random Forests reduces the risk of overfitting associated with individual decision trees, making it highly effective for both classification and regression tasks.
Key Concept:
Random Forests builds a set of decision trees on randomly selected subsets of the data. Each tree in the forest makes a prediction, and the algorithm aggregates these predictions:
For classification, Random Forests takes a majority vote.
For regression, it averages the outputs of all trees.
How Random Forests Works:
Bootstrap Sampling: From the original dataset, create multiple subsets (samples) by randomly selecting data points with replacement.
Train Decision Trees: Build a decision tree on each sample. During training, at each split, only a random subset of features is considered.
Aggregate Predictions: For a new data point, each tree makes a prediction. For classification, the most common prediction among the trees becomes the final output; for regression, the average prediction is used.
Example Calculation for Classification:
Suppose Random Forests is used to predict if a customer will buy or not buy a product. We create 5 decision trees, and each tree votes based on its training sample.
Assume the following tree predictions for a new customer:
Tree 1: buy
Tree 2: not buy
Tree 3: buy
Tree 4: buy
Tree 5: not buy
Since buy has the majority vote, Random Forests would predict buy for this customer.
Mathematical Representation:
For a classification problem with trees \( T_1, T_2, \dots, T_n \) and a data point \( x \) , the Random Forest prediction \( y \) is: \( y = \text{majority\_vote}(T_1(x), T_2(x), \dots, T_n(x)) \)
For regression, the prediction \( y \) is the average: \( y = \frac{1}{n} \sum_{i=1}^n T_i(x) \)
How to Read: "y equals the majority vote of T-sub-1 of x, T-sub-2 of x, up to T-sub-n of x" for classification, or "y equals one over n times the sum from i equals 1 to n of T-sub-i of x" for regression.
Explanation of Notation:
\( T_i \) : The (i)-th decision tree in the forest.
\( x \) : The data point being classified or predicted.
\( n \) : The total number of decision trees in the forest.
Real-Life Example and Interpretation:
Imagine a credit card company using Random Forests to predict if a new applicant is high risk or low risk for credit. By training each tree on a different subset of applicant data, the model gets a broader view of patterns in financial behavior, helping reduce individual biases of any single tree.
Assume:
Five trees are trained to predict risk level.
The following tree predictions are made for a new applicant:
Tree 1: high risk
Tree 2: low risk
Tree 3: low risk
Tree 4: high risk
Tree 5: low risk
Calculation Train of Thought:
Count Votes for Each Class:
High risk: 2 votes
Low risk: 3 votes
Aggregate Predictions:
Since low risk has the majority vote, the Random Forests model would classify this applicant as low risk.
Output Interpretation:
The final output of Random Forests is a robust prediction that leverages multiple decision trees to improve accuracy. This approach reduces the risk of errors that might arise from relying on a single decision tree.
Friendly Explanation:
Think of Random Forests as consulting multiple advisors (trees) on a decision. Each advisor looks at different parts of the data and gives their vote. The Random Forests model then goes with the majority opinion, creating a well-rounded prediction that’s less likely to be swayed by any single opinion.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by