Implementing Random Forest using Scikit learn


The Random Forest algorithm is an ensemble learning method primarily used for classification and regression tasks. It operates by constructing a multitude of decision trees during training and outputs the mode of the classes (for classification) or the mean prediction (for regression) of the individual trees.
The process begins by creating various subsets of the training data through a technique known as bootstrap sampling, where random samples (with replacement) are drawn from the dataset. For each of these subsets, a decision tree is built. Unlike a standard decision tree that considers all features when making splits, Random Forest introduces an additional layer of randomness by only selecting a random subset of features at each split, which helps to enhance the diversity among the trees.
This diversity among the trees reduces the risk of overfitting, which is a common problem in single decision trees. After all trees are constructed, the final output is determined through majority voting for classification tasks or averaging for regression tasks. Random Forest is valued for its robustness, high accuracy, and ability to handle large datasets with higher dimensionality while maintaining computational efficiency. Additionally, it provides insights into feature importance, allowing for better understanding and interpretability of the model's predictions. To better understand Random forest first we need to understand Decision tree algorithm.
Decision Tree algorithm of Machine Learning
The Decision Tree algorithm is a popular machine learning method used for classification and regression tasks. It models decisions in a tree-like structure, where each internal node represents a feature (or attribute), each branch corresponds to a decision rule, and each leaf node represents an outcome (or class label). The decision tree is constructed through an iterative process where the goal is to partition the input space in a way that maximizes the homogeneity of the resulting subsets.
Key concepts involved in building a decision tree include:
Entropy: Entropy is a measure of the disorder or uncertainty in a set of data. In the context of decision trees, it quantifies the impurity or randomness of the class labels in a dataset. The formula for calculating entropy H(S) for a set ( S ) with class labels is given by:
\( H(S) = - \sum_{i=1}^{c} p_i \log_2 p_i \)
where \(p_i \) is the proportion of class ( i ) in the dataset and ( c ) is the total number of classes. Lower entropy indicates that the data is more pure and homogeneous, while higher entropy signifies more mixed data.
Information Gain: Information Gain measures the reduction in entropy after a dataset is split based on a particular feature. It helps to identify which feature best separates the classes. The Information Gain ( IG(S, A) ) for a feature ( A ) is calculated as:
entropy(parent) – [average entropy(children)]
\( IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) \)
where \(S_v \) is the subset of \(S \) where feature ( A ) takes on value ( v ). The feature with the highest Information Gain is chosen for the split as it provides the clearest separation of classes.
Gini Impurity: Gini Impurity is an alternative metric for measuring the quality of a split in a decision tree. It assesses how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini Impurity Gini(S) ) for a set ( S ) is calculated as:
\(Gini(S) = 1 - \sum_{i=1}^{c} p_i^2 \)
Like entropy, a lower Gini Impurity indicates a more homogeneous subset. Decision trees can be constructed using Gini Impurity as the criterion for splitting nodes, typically yielding faster results compared to entropy.
Decision Trees utilize Entropy or Gini Impurity as criteria to decide on the best features to split the dataset, aiming to create a model that accurately represents the underlying patterns of the data while fostering interpretability and ease of use.
Step-by-Step Calculation of Information Gain
Calculate the Entropy of the Whole Dataset: Let's say we have a dataset of 10 instances with the following classes: 6 positive instances (Yes) and 4 negative instances (No). The formula for entropy (H) is given by:
\( H(S) = - \sum (p_i \cdot \log_2(p_i)) \)
Here, \( p_i \) is the proportion of each class in the dataset.
For our dataset:
Proportion of Yes \( p_Y\)\= 6/10 = 0.6
Proportion of No \(p_N\) = 4/10 = 0.4
Substituting these values into the entropy formula:
\(H(S) = - (0.6 \cdot \log_2(0.6) + 0.4 \cdot \log_2(0.4)) \approx 0.971 \)
Split the Dataset based on a Feature: Suppose we have a feature called "Weather" with three possible outcomes: Sunny, Rainy, and Overcast.
Let's say it splits our dataset into:
Sunny: 3 Yes, 1 No (4 instances)
Rainy: 2 Yes, 2 No (4 instances)
Overcast: 1 Yes, 1 No (2 instances)
Calculate the Entropy for Each Subset: For each subset, we calculate the entropy.
Sunny:
Proportion of Yes \(p_Y\)\= 3/4 = 0.75
Proportion of No \(P_N\) = 1/4 = 0.25
\( H(Sunny) \approx - (0.75 \cdot \log_2(0.75) + 0.25 \cdot \log_2(0.25)) \approx 0.811 \)
Rainy:
Proportion of Yes \(P_Y\)\= 2/4 = 0.5
Proportion of No \(P_N\)\= 2/4 = 0.5
\(H(Rainy) \approx - (0.5 \cdot \log_2(0.5) + 0.5 \cdot \log_2(0.5)) = 1.0 \)
Overcast:
Proportion of Yes \(P_Y\)\= 1/2 = 0.5
Proportion of No \(P_N\)\= 1/2 = 0.5
\( H(Overcast) \approx 1.0 \)
Calculate the Weighted Average Entropy of Subsets: Now, we need to find the weighted average entropy based on the size of each subset:
\(H(Feature) = \frac{4}{10} \cdot H(Sunny) + \frac{4}{10} \cdot H(Rainy) + \frac{2}{10} \cdot H(Overcast) = \frac{4}{10} \cdot 0.811 + \frac{4}{10} \cdot 1.0 + \frac{2}{10} \cdot 1.0 \approx 0.825\)
Calculate Information Gain: Finally, we compute the Information Gain by subtracting the weighted average entropy of the feature from the original entropy:
\(IG = H(S) - H(Feature) = 0.971 - 0.825 \approx 0.146 \)
The Information Gain tells us how much information about the classification of the dataset is provided by the "Weather" feature. In this case, the Information Gain of approximately 0.146 indicates that the "Weather" feature does provide some useful information, making it a good candidate for splitting the dataset in decision tree algorithms. For each feature IG is calculated to find the root of the tree.
Now Lets Implement Random Forest using a sample dataset and Scikit learn library.
Problem Domain
To address the challenge of predicting whether the price of a New York City Airbnb listing will be above or below the average price, we utilize a tabular dataset containing information about various Airbnb listings in the city. Data set is available to down from
https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data
Or it can be loaded from
https://raw.githubusercontent.com/lmassaron/tabular_datasets/master/AB_NYC_2019.csv
Lets Load all libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import KFold, cross_validate
from sklearn.ensemble import RandomForestClassifier
Now lets load the data and take a look at it.
data = pd.read_csv("https://raw.githubusercontent.com/lmassaron/tabular_datasets/master/AB_NYC_2019.csv")
1 to 5 of 5 entriesFilter
index | id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 |
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.9419 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.1 | 1 | 0 |
List of features to be excluded from the analysis
A list of features that should be excluded from the analysis, such as unique identifiers and text features
excluding_list = ['price', 'id', 'latitude', 'longitude', 'host_id',
'last_review', 'name', 'host_name']
Lets take a look at the categorical features
categorical = ['neighbourhood_group', 'neighbourhood', 'room_type']
data[categorical].nunique()
0
neighbourhood_group 5
neighbourhood 221
room_type 3
dtype: int64
If we do one hot encoding of all categorical features it will create many columns and with too many Zeros creating problem during model training. So we divide into list of low-cardinality categorical features to be one-hot encoded and list of high-cardinality categorical features to be ordinally encoded. ‘low_card_categorical’ is a subset of categorical features that have a low cardinality (few unique values) and will be one-hot encoded. ‘high_card_categorical’ is a subset of categorical features that have a high cardinality (many unique values) and will be encoded using an ordinal encoding
low_card_categorical = ['neighbourhood_group', 'room_type']
high_card_categorical = ['neighbourhood']
All required integer columns are added as. ‘continuous ‘ is a list of continuous numerical features that will be standardized for analysis
continuous = ['minimum_nights', 'number_of_reviews', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365']
whole data shape looks like
data.shape
(48895, 16)
We create binary targets, target_median, based on percentiles for classification purposes. It is important to note that our target_median is a balanced binary target, allowing us to safely use accuracy as an effective performance measurement. We find an almost equal number of cases for both the positive and negative classes when we count the values.
target_median.value_counts()
count
price
0 24472
1 24423
dtype: int64
In the context of the Scikit-learn library, a transformer object is a component that is designed to perform data transformation tasks as part of a machine learning pipeline. These objects are crucial for preprocessing data, including steps such as normalization, encoding categorical variables, or reducing dimensionality.
In the next step, we will develop a series of transformers designed to preprocess the data, ensuring it is well-prepared for the analysis required for this project. These transformers will help clean, organize, and transform the raw data into a format that facilitates more accurate and insightful analysis.
categorical_onehot_encoding: This transformer is designed to perform one-hot encoding on low-cardinality categorical features. It converts categorical variables into a format that can be provided to machine learning algorithms, effectively representing each category as a binary vector.
categorical_ord_encoding: This transformer is tailored for high-cardinality categorical features and employs ordinal encoding. It assigns integer values to unique categories based on their order, making it suitable for situations where categories have a meaningful sequence.
numeric_passthrough: This transformer serves a straightforward purpose: it passes continuous numerical features directly to the next stage in the data processing pipeline without any alteration. This ensures that the integrity of numerical data is maintained.
categorical_onehot_encoding = OneHotEncoder(handle_unknown='ignore') categorical_ord_encoding = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan) numeric_passthrough = SimpleImputer(strategy="constant", fill_value=0)
The code creates a ColumnTransformer object that is designed to manage different types of features in a dataset by applying specific transformations tailored to each subset. It uses one-hot encoding for categorical features with a limited number of unique values (low-cardinality), effectively converting them into a format that can be utilized by machine learning algorithms. Meanwhile, continuous numerical features are passed through without any modifications, preserving their original values.
The transformer is configured to exclude any features that are not explicitly included in the defined transformation steps, thus maintaining a clean and relevant set of output features. Additionally, the output feature names will be concise and clear, making it easier to interpret the results. To ensure the output is always in a usable format, the sparse_threshold parameter is set to zero, guaranteeing that the transformer will return dense arrays, regardless of the input data's sparsity.
column_transform = ColumnTransformer(
[('low_card_categories', categorical_onehot_encoding, low_card_categorical),
('high_card_categories', categorical_ord_encoding, high_card_categorical),
('numeric', numeric_passthrough, continuous),
],
remainder='drop',
verbose_feature_names_out=False,
sparse_threshold=0.0)
K-fold cross-validation is a powerful technique used to evaluate the performance of a machine learning model. It involves dividing the available training dataset into k distinct partitions or "folds." The process begins by training the model k times, where in each iteration, the model is trained on k-1 of these partitions while reserving the one remaining partition as a testing set. This means that each fold gets the opportunity to serve as the validation set once, allowing for a comprehensive assessment of the model’s performance.
Once all k models have been trained and evaluated, we calculate the average of the performance scores obtained from each fold. Additionally, we assess the standard deviation of these scores to gauge the consistency of the model’s performance across the different subsets of data. This statistical approach not only provides a more reliable estimate of how the model is likely to perform on unseen data but also quantifies the uncertainty surrounding this estimate, giving insights into the model's robustness and generalizability.
We are setting up a RandomForestClassifier, a popular ensemble learning method for classification tasks. In this implementation, we are utilizing 300 estimators, which means the model will build 300 individual decision trees to enhance overall predictive accuracy and robustness. Additionally, we have specified that the minimum number of samples required to be in a leaf node is 3. This parameter helps to prevent overfitting by ensuring that each leaf has a sufficient number of samples, promoting generalization in our model's predictions.
accuracy = make_scorer(accuracy_score)
cv = KFold(5, shuffle=True, random_state=0)
model = RandomForestClassifier(n_estimators=300,
min_samples_leaf=3,
random_state=0)
column_transform = ColumnTransformer(
[('categories', categorical_onehot_encoding, low_card_categorical),
('numeric', numeric_passthrough, continuous)],
remainder='drop',
verbose_feature_names_out=False,
sparse_threshold=0.0)
A Scikit-learn pipeline that systematically applies data transformations to columns of a dataset, followed by the implementation of a Random Forest classifier model to facilitate the prediction or classification tasks. This pipeline allows for seamless integration of preprocessing steps, such as encoding categorical variables and scaling numerical features, leading into the training of the Random Forest model, thus streamlining the workflow for machine learning.
model_pipeline = Pipeline(
[('processing', column_transform),
('modeling', model)]) #C
In our analysis, we utilize Scikit-learn's cross_validate
function to perform a comprehensive five-fold cross-validation. This method involves segmenting our dataset into five distinct subsets or "folds." For each iteration, we train the model on four of these folds while using the remaining fold as a validation set. This process is repeated until each fold has been used as the validation set once. Throughout this procedure, we calculate and record the accuracy scores for each fold, allowing us to assess the performance of our defined machine learning pipeline more robustly. By averaging these accuracy scores across all five folds, we can obtain a reliable estimate of the model's overall effectiveness.
cv_scores = cross_validate(estimator=model_pipeline,
X=data,
y=target_median,
scoring=accuracy,
cv=cv,
return_train_score=True,
return_estimator=True)
We retrieve the mean and standard deviation of the accuracy scores from cross-validation
mean_cv = np.mean(cv_scores['test_score'])
std_cv = np.std(cv_scores['test_score'])
fit_time = np.mean(cv_scores['fit_time'])
score_time = np.mean(cv_scores['score_time'])
print(f"{mean_cv:0.3f} ({std_cv:0.3f})",
f"fit: {fit_time:0.2f} secs pred: {score_time:0.2f} secs")
0.826 (0.004) fit: 13.86 secs pred: 0.58 secs
We successfully implemented Random Forest algorithm using Scikit learn.
Subscribe to my newsletter
Read articles from Nitin Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
