Day 11: Random Forests – Wisdom of the Crowd in Machine Learning

Saket KhopkarSaket Khopkar
5 min read

Welcome to Day 11 of your 30-Day AI Journey! After learning about Decision Trees, today we dive into Random Forests, one of the most powerful and widely used machine learning algorithms.

Short and Simple : A Random Forest is basically a bunch of Decision Trees working together.

💡
One tree might make a mistake, but if many trees vote together, we get a strong, accurate result. It's like asking 100 friends for advice instead of just one!

Suppose you're unsure whether to watch a movie.
You ask 5 friends:

  • Friend A: Says “Yes”

  • Friend B: Says “No”

  • Friend C: Says “Yes”

  • Friend D: Says “Yes”

  • Friend E: Says “No”

Majority says “Yes”, so you go watch the movie. Majority Wins, thats basically how Random Forests function.


Working of the Random Forest

You will hear this word quite frequently now : “BootStrap Sampling“. As we know that a Random Forest is made up of uncountable Decision Trees. So, the BootStrap Sampling is the first step towards implementation of this algorithm.

Step 1 : BootStrap Sampling

BootStrap Sampling means that each tree gets its very own training data, created multiple subsets from the original dataset (with replacement). Instead of giving all trees the same data, we give each tree a random set of data.

Think of it like giving 10 friends slightly different reviews about a phone. Each friend has their own opinion, based on their experience.

In this case, following things might happen:

  • Some reviews repeat. (Multiple people may suggest the same)

  • Some are left out. (We might ignore soemone’s suggestion)

So each friend (tree) has their own set of reviews to decide from.

Step 2 : Random Feature Selection

Each tree uses a random set of features at every split (makes trees different!).

Let us continue our example. Let’s say the phone has 5 features:

  • Battery

  • Camera

  • Storage

  • Price

  • Design

But each friend (tree) only looks at 2 or 3 random features; like Battery & Price, or Camera & Design.
This keeps their opinions different and prevents them from copying each other.

Step 3 : Growing Trees

For each subset, grow a decision tree.

Using the info they got, each friend (tree) comes up with their own decision rules, like:

  • “If the price is low and battery is good → Buy it.”

  • “If the camera is poor → Skip it.”

These rules build a tree-like path from question to answer.

Step 4 : Prediction

Putting it straightforward:;

Now all the trees (friends) give their final opinions.

  • If most say yes, the forest says yes.

  • If most say no, the forest says no.

Here we can consider regression as well as classification. For the classification, we only consider YES and NO, i.e. consiering majority before taking a decision. As far as the Regression is concerned (like predicting a price), we take the average of all trees’ numbers.


Time to Code

Let us have a look at a practical example.

# Data handling and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Let's simulate some data (You can later replace this with your own CSV)
data = {
    'Age': [22, 30, 35, 40, 50, 60, 18, 27, 55, 45],
    'Salary': [25000, 60000, 70000, 50000, 80000, 95000, 20000, 48000, 85000, 72000],
    'Bought_Car': [0, 1, 1, 0, 1, 1, 0, 0, 1, 1]  # 1 = Bought, 0 = Didn't Buy
}

df = pd.DataFrame(data)
df.head()

As data can be very big, I have used head() function to show only top rows.

# Split into features (X) and target (y)
X = df[['Age', 'Salary']]
y = df['Bought_Car']

# 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training Set Size:", X_train.shape)
print("Test Set Size:", X_test.shape)

# Create the model
rf = RandomForestClassifier(
    n_estimators=100,       # Number of trees in the forest
    max_depth=3,            # Limit depth to prevent overfitting
    criterion='gini',       # Split quality measure: 'gini' or 'entropy'
    random_state=42
)

# Train the model
rf.fit(X_train, y_train)
# Predict on test data
y_pred = rf.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='d')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred)

Again 66%, 2 correct 1 wrong.

# Get feature importances
importances = rf.feature_importances_
features = X.columns

# Visualize importance
sns.barplot(x=importances, y=features)
plt.title("Feature Importance in Random Forest")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.show()

What should be conclusion after seeing this graph? If Salary is more important, the model uses it more often to make decisions. This helps us understand what the model is learning from the data.

Completely option step but, you can visualize one tree from the forest to see how it works internally. Like this:

from sklearn import tree

# Visualize the first tree
plt.figure(figsize=(15,10))
tree.plot_tree(rf.estimators_[0],
               feature_names=X.columns,
               class_names=["Not Buy", "Buy"],
               filled=True)
plt.title("One Tree from the Random Forest")
plt.show()

Time to test:

# Let's test: A 33-year-old with 68k salary
sample = pd.DataFrame({'Age': [33], 'Salary': [68000]})
prediction = rf.predict(sample)
print("Prediction (1=Buy, 0=Not Buy):", prediction[0])

We get the answer as : Prediction (1=Buy, 0=Not Buy): 1

You may tweak the numbers in above code to see results for yourself. Here are some more things or shall we say parameters you can tune to experiment a bit.

ParameterPurpose
n_estimatorsNumber of trees (more = better, but slower)
max_depthLimits how deep each tree can go (prevents overfitting)
min_samples_splitMinimum samples needed to split a node
criterionHow trees decide splits: "gini" or "entropy"
random_stateFor reproducible results

Closing Statement

Random Forest is robust, flexible, and handles overfitting better than single decision trees. It uses many different trees trained on random subsets of data and features. You can see which features are most important to the model.

You now have a full working implementation with training, testing, evaluation, and visualization practiced up in above code. I would request you to play around the parameters so you might end up getting a rough idea around the concepts we studied in this blogpost.

Great then, time to signoff for Day 11. Ciao!!

0
Subscribe to my newsletter

Read articles from Saket Khopkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Saket Khopkar
Saket Khopkar

Developer based in India. Passionate learner and blogger. All blogs are basically Notes of Tech Learning Journey.