From Branches to Forests: Decision Trees and Random Forests in Python

In an era where data guides our decisions, the ability to derive meaningful insights from complex datasets is crucial. Machine learning offers powerful tools to achieve this, and among these, Decision Trees and Random Forests stand out for their unique combination of interpretability and effectiveness. Whether it’s filtering spam emails, predicting stock movements, or diagnosing medical conditions, these algorithms provide structured and reliable decision-making mechanisms.
Decision Trees capture the essence of human decision-making by breaking down complex problems into a series of simpler, sequential choices. Their tree-like structure makes it easy to visualize and understand how a model arrives at a particular decision. However, while Decision Trees are intuitive, they often overfit the data, making them less reliable on unseen samples.
To overcome this limitation, Random Forests enter the scene. By combining the outputs of multiple Decision Trees, Random Forests harness the wisdom of crowds to produce more robust and accurate predictions. They reduce overfitting, improve generalization, and handle large datasets with ease. This blend of simplicity, interpretability, and performance makes them a popular choice in machine learning applications.
Mathematical Background of Decision Trees
Concept of Decision Trees
Decision Trees are hierarchical structures used for making decisions based on data features. Each node in a decision tree represents a test on a feature, and each branch represents an outcome of that test. The process can be summarized as:
Splitting Nodes: At each node, the dataset is split based on the value of a feature.
Leaf Nodes: These represent the final output:
For classification tasks, the leaf nodes represent class labels.
For regression tasks, the leaf nodes represent numerical values.
A decision tree works by repeatedly partitioning the dataset into subsets that are as homogeneous as possible (i.e., contain similar outcomes).
Information Gain and Entropy
Entropy
Entropy measures the impurity or uncertainty in a dataset. The higher the entropy, the more uncertain the dataset.
The Entropy \(H(S)\) for a dataset \(S\) is defined as:
$$H(S) = -\sum_{i=1}^n p_i \log_2 p_i$$
Where:
\(p_i\)is the probability of class \(i\).
\(n\) is the total number of classes.
Example:
Suppose a dataset \(S\) has 10 instances, where 6 are Positive (P) and 4 are Negative (N). The entropy of \(S\)is:
$$H(S) = -\left(\frac{6}{10} \log_2 \frac{6}{10} + \frac{4}{10} \log_2 \frac{4}{10}\right) $$
$$H(S) = -(0.6 \log_2 0.6 + 0.4 \log_2 0.4) \approx 0.97$$
Information Gain
Information Gain (IG) measures the reduction in entropy when a dataset \(S\) is split based on a feature \(A\). It helps in selecting the best feature to split the data.
The Information Gain \(IG(S, A)\) is defined as:
$$IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)$$
Where:
\(H(S)\) is the entropy of the original dataset.
\(S_v\) is the subset of \(S\) where feature \(A\) has value \(v\).
\(\frac{|S_v|}{|S|}\) is the proportion of \(S\) belonging to \( S_v \) .
\(H(S_v)\) is the entropy of the subset \(S_v\).
Example:
Suppose we have the following dataset with the feature "Weather" and the target "Play Tennis":
Weather | Play Tennis |
Sunny | Yes |
Sunny | No |
Overcast | Yes |
Rainy | Yes |
Rainy | No |
Calculate \(H(S)\)(overall entropy).
Split by Weather and calculate \(H(S_v) \) for each subset.
Compute Information Gain.
Gini Index
The Gini Index (also known as Gini Impurity) is an alternative to entropy for measuring the impurity of a dataset. Lower Gini values indicate purer subsets.
The Gini Index \(G(S)\) is defined as:
$$G(S) = 1 - \sum_{i=1}^n p_i^2$$
Where:
- \(p_i\) is the probability of class \(i\).
Example:
Consider a dataset with 6 Positive (P) and 4 Negative (N) instances. The Gini Index is:
$$G(S) = 1 - \left(\left(\frac{6}{10}\right)^2 + \left(\frac{4}{10}\right)^2\right)$$
$$ G(S) = 1 - (0.36 + 0.16) = 0.48$$
Example Calculation
Let’s demonstrate how to calculate Entropy and Information Gain for a dataset.
Dataset: "Weather" and "Play Tennis"
Weather | Play Tennis |
Sunny | No |
Sunny | No |
Overcast | Yes |
Rainy | Yes |
Rainy | Yes |
Rainy | No |
Overcast | Yes |
Sunny | Yes |
Sunny | Yes |
Rainy | No |
Sunny | Yes |
Overcast | Yes |
Overcast | Yes |
Rainy | No |
Calculate \(H(S) \) (Overall Entropy):
- 9 Yes and 5 No outcomes.
$$H(S) = -\left(\frac{9}{14} \log_2 \frac{9}{14} + \frac{5}{14} \log_2 \frac{5}{14}\right)$$
$$ H(S) \approx 0.94$$
- Split by "Weather" and compute the entropy for each subset.
Sunny: 2 No, 3 Yes \(H(S_{\text{Sunny}}) = -\left(\frac{2}{5} \log_2 \frac{2}{5} + \frac{3}{5} \log_2 \frac{3}{5}\right) \approx 0.97\)
Overcast: 4 Yes \(H(S_{\text{Overcast}}) = 0\) (pure subset)
Rainy: 3 No, 2 Yes \(H(S_{\text{Rainy}}) = -\left(\frac{3}{5} \log_2 \frac{3}{5} + \frac{2}{5} \log_2 \frac{2}{5}\right) \approx 0.97\)
- Calculate Information Gain for the "Weather" feature:
$$IG(S, \text{Weather}) = 0.94 - \left(\frac{5}{14} \times 0.97 + \frac{4}{14} \times 0 + \frac{5}{14} \times 0.97\right) $$
$$IG(S, \text{Weather}) \approx 0.25$$
This demonstrates that "Weather" provides a significant reduction in entropy and is a good feature for splitting.
Building Decision Trees in Python
a. Setting Up the Environment
First, ensure you have the necessary libraries installed. You can install them using pip
:
pip install pandas numpy matplotlib seaborn scikit-learn
Next, import the required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
b. Loading a Dataset
We'll use the Iris dataset for this example. It is a classic dataset for classification tasks.
# Load the Iris dataset
data = load_iris()
X = data.data # Feature matrix
y = data.target # Target vector
# Display dataset information
print("Feature Names:", data.feature_names)
print("Target Names:", data.target_names)
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
c. Training a Decision Tree
We'll split the data into training and testing sets, then train a Decision Tree classifier using the entropy criterion.
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Decision Tree classifier
dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt.fit(X_train, y_train)
d. Model Evaluation
Evaluate the trained Decision Tree by predicting the test set and calculating key metrics:
# Make predictions on the test set
y_pred = dt.predict(X_test)
# Calculate and print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
# Display confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Display classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
e. Visualization of Decision Tree
Visualize the structure of the trained Decision Tree to understand the decision-making process.
# Plot the Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(dt, feature_names=data.feature_names, class_names=data.target_names, filled=True)
plt.show()
f. Testing on a New Unseen Value
To understand how the Decision Tree behaves with new, unseen data, let's manually input a new sample and predict its class.
Prepare a New Sample:
A new sample should have the same number of features as the training data. For the Iris dataset, each sample has 4 features:Sepal length (cm)
Sepal width (cm)
Petal length (cm)
Petal width (cm)
Predict the Class:
Use the trained Decision Tree to predict the class label for the new sample.
# Example of a new unseen sample (replace these values as desired)
new_sample = np.array([[5.1, 3.5, 1.4, 0.2]]) # Sample representing a setosa-like flower
# Predict the class for the new sample
predicted_class = dt.predict(new_sample)
# Display the prediction
print("Predicted Class:", data.target_names[predicted_class[0]])
Explanation of the Code:
The new sample is provided as a 2D NumPy array (required by
scikit-learn
).The
predict
method returns the class index (e.g.,0
,1
, or2
).The class index is mapped to the corresponding class name using
data.target
_names
.
Example Output
Predicted Class: setosa
Sample Output
Accuracy: Displays the accuracy of the model.
Confusion Matrix: Shows how well the model classifies each class.
Classification Report: Provides precision, recall, and F1-score for each class.
Decision Tree Plot: A visual diagram representing the decision-making process.
Introduction to Random Forests
What is a Random Forest?
A Random Forest is a powerful ensemble learning algorithm that constructs multiple decision trees during training and combines their outputs to make robust predictions. It addresses the limitations of individual decision trees, particularly their tendency to overfit the training data.
In essence:
Random Forest = An ensemble of decision trees trained on different subsets of data.
The predictions from all the trees are combined to form a final decision, which enhances generalization and reduces overfitting.
The idea behind Random Forests is that the combined "wisdom" of many trees leads to more accurate and stable predictions than relying on a single tree.
Mathematical Background
Bootstrap Aggregating (Bagging)
Bagging is the key principle behind Random Forests. It stands for Bootstrap Aggregating and involves the following steps:
Random Sampling with Replacement (Bootstrapping):
From the original dataset \(S\) of size \(n\), multiple bootstrap samples \(S_1, S_2, \dots, S_k\) are created, each of size \(n\).
Each bootstrap sample is drawn with replacement, meaning the same data point can appear multiple times in a single sample.
Training Multiple Decision Trees:
- A decision tree is trained independently on each bootstrap sample \(S_i \) .
Combining the Outputs (Aggregation):
For classification tasks: The final prediction is determined by majority voting among the trees. \(\hat{y} = \text{mode}\left(y_1, y_2, \dots, y_k\right) \)
For regression tasks: The final prediction is the average of the predictions. \(\hat{y} = \frac{1}{k} \sum_{i=1}^k y_i\)
By aggregating multiple predictions, bagging reduces the variance of the model, leading to better generalization on unseen data.
Random Feature Selection
In addition to bootstrapping, Random Forests introduce randomness during tree construction through random feature selection. This further reduces overfitting by ensuring that the trees are decorrelated.
The process works as follows:
Feature Subsets at Each Split:
Instead of considering all available features at each node, a random subset of features \(m\) (where \(m < \text{total features}\)) is considered.
The best split is chosen based on this subset.
Why Random Feature Selection?
Helps in reducing the correlation between individual trees.
Increases diversity among the trees, which improves the overall ensemble performance.
Typical Values for \(m\):
For classification: \(m = \sqrt{d} \) (where \(d\) is the total number of features).
For regression: \(m = \frac{d}{3}\).
Implementing Random Forests in Python
In this section, we will implement a Random Forest Classifier using the Iris dataset. We'll go through the steps of training the model, evaluating it, displaying feature importance, and testing it on new, unseen data.
a. Training a Random Forest
Let's train a Random Forest classifier with 100 decision trees (n_estimators=100
) using the Iris dataset.
# Import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
# Create the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the Random Forest on the training data
rf.fit(X_train, y_train)
b. Model Evaluation
Evaluate the trained Random Forest on the test set using accuracy, confusion matrix, and a classification report.
# Make predictions on the test set
y_pred_rf = rf.predict(X_test)
# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
# Display confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
# Display classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))
Sample Output
Accuracy: 1.0
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
c. Feature Importance
Random Forests can provide insights into which features are the most important for making decisions. Let’s visualize the feature importances.
# Get feature importances from the Random Forest model
importances = rf.feature_importances_
feature_names = data.feature_names
# Plot the feature importances
plt.figure(figsize=(10, 5))
sns.barplot(x=importances, y=feature_names)
plt.xlabel("Feature Importance")
plt.title("Feature Importance in Random Forest")
plt.show()
Explanation:
High Importance: Features that contribute the most to the model's predictions.
Low Importance: Features that have less impact on the decision-making process.
d. Predicting on New Unseen Data
Now, let's see how the trained Random Forest performs on new, unseen data.
Prepare a New Sample:
Create a new sample with the same number of features as the Iris dataset.Make a Prediction:
Use thepredict
method to classify the new sample.
# New unseen sample (replace values as desired)
new_sample = np.array([[5.8, 2.7, 4.1, 1.0]]) # Example with 4 features
# Predict the class for the new sample
predicted_class_rf = rf.predict(new_sample)
# Display the predicted class
print("Predicted Class:", data.target_names[predicted_class_rf[0]])
Example Output:
Predicted Class: versicolor
Comparing Decision Trees and Random Forests
Pros and Cons
Criteria | Decision Trees | Random Forests |
Simplicity | Simple, intuitive, and easy to interpret. | Complex due to the ensemble of multiple trees. |
Interpretability | Highly interpretable; easy to visualize and explain. | Harder to interpret because of multiple trees. |
Overfitting | Prone to overfitting, especially with deep trees. | Reduces overfitting by combining multiple trees. |
Performance | Performs well on small or simple datasets. | Better performance on large, complex datasets. |
Training Time | Faster to train. | Slower to train due to multiple trees. |
Robustness | Sensitive to noise and changes in the data. | Robust to noise and outliers due to averaging predictions. |
Generalization | May not generalize well on unseen data. | Generalizes better by reducing variance. |
When to Use Which
Use Decision Trees When:
You need a simple, quick, and interpretable model.
The dataset is small or the relationships between features are straightforward.
Interpretability is critical (e.g., when explaining decisions to non-technical stakeholders).
Use Random Forests When:
You need a model that offers higher accuracy and better generalization.
The dataset is large or has complex patterns.
Overfitting is a concern, and you need a model robust to noise and outliers.
Conclusion
Key Takeaways
In this blog, we've explored the foundations and practical implementation of Decision Trees and Random Forests in Python. Here's a recap of the key concepts:
Understanding Decision Trees:
How Decision Trees split data using metrics like Information Gain (based on Entropy) and Gini Index.
Their intuitive and interpretable structure, making them valuable for simple decision-making tasks.
Random Forests:
How Random Forests improve upon Decision Trees by employing Bootstrap Aggregating (Bagging) and Random Feature Selection to reduce overfitting and increase robustness.
Their ability to generalize better by combining multiple decision trees.
Practical Implementation:
Step-by-step implementation of both Decision Trees and Random Forests in Python.
Visualization of Decision Trees and understanding model evaluation through metrics like accuracy, confusion matrix, and classification report.
Next Steps
To deepen your understanding, consider the following:
Hyperparameter Tuning:
Experiment with hyperparameters likemax_depth
,min_samples_split
, andn_estimators
to optimize the performance of your models.Explore More Datasets:
Try implementing Decision Trees and Random Forests on other popular datasets:California Housing Dataset (for Regression)
Breast Cancer Dataset (for Classification)
Below is the Python code to load these datasets using scikit-learn
:
Loading the California Housing Dataset (Regression)
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
california_data = fetch_california_housing()
X = california_data.data
y = california_data.target
# Display dataset information
print("Feature Names:", california_data.feature_names)
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
Loading the Breast Cancer Dataset (Classification)
from sklearn.datasets import load_breast_cancer
# Load the Breast Cancer dataset
cancer_data = load_breast_cancer()
X = cancer_data.data
y = cancer_data.target
# Display dataset information
print("Feature Names:", cancer_data.feature_names)
print("Target Names:", cancer_data.target_names)
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
Final Thoughts
By mastering Decision Trees and Random Forests, you're equipped with powerful tools to tackle both classification and regression problems. Keep experimenting, visualizing, and fine-tuning your models to understand the nuances of machine learning better. Happy coding!
Subscribe to my newsletter
Read articles from Jyotiprakash Mishra directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Jyotiprakash Mishra
Jyotiprakash Mishra
I am Jyotiprakash, a deeply driven computer systems engineer, software developer, teacher, and philosopher. With a decade of professional experience, I have contributed to various cutting-edge software products in network security, mobile apps, and healthcare software at renowned companies like Oracle, Yahoo, and Epic. My academic journey has taken me to prestigious institutions such as the University of Wisconsin-Madison and BITS Pilani in India, where I consistently ranked among the top of my class. At my core, I am a computer enthusiast with a profound interest in understanding the intricacies of computer programming. My skills are not limited to application programming in Java; I have also delved deeply into computer hardware, learning about various architectures, low-level assembly programming, Linux kernel implementation, and writing device drivers. The contributions of Linus Torvalds, Ken Thompson, and Dennis Ritchie—who revolutionized the computer industry—inspire me. I believe that real contributions to computer science are made by mastering all levels of abstraction and understanding systems inside out. In addition to my professional pursuits, I am passionate about teaching and sharing knowledge. I have spent two years as a teaching assistant at UW Madison, where I taught complex concepts in operating systems, computer graphics, and data structures to both graduate and undergraduate students. Currently, I am an assistant professor at KIIT, Bhubaneswar, where I continue to teach computer science to undergraduate and graduate students. I am also working on writing a few free books on systems programming, as I believe in freely sharing knowledge to empower others.