decision tree ensembles and the random forest algorithm

Sanika NandpureSanika Nandpure
2 min read

Decision trees on their own can be quite sensitive to small changes in the data. One solution is to build multiple decision trees. Each tree contributes an output, and the value outputted by the majority of the trees is the final output. Making multiple trees makes the model more robust to changes in data and increases accuracy.

But how do we come up with these different decision trees if the process for creating a tree is the same?

sampling with replacement

We construct multiple random training sets using random sampling with replacement. In other words, we pick an example from the training set randomly, but then place it back in the training set before drawing again. This means that we can have duplicate examples in our modified training sets.

Given a training set of size m, we generate B new training sets of size m using samples with replacement. We train a decision tree on each new dataset, creating B unique trees that can be bundled together to create a tree ensemble. The higher B is, the better–however, higher B comes with higher computation costs; furthermore, after a point, increasing B yields diminishing returns. This is often called bagged decision trees. A better version of the bagged decision tree algorithm is called the random forest algorithm. The key idea is that even with this sampling with replacement procedure, sometimes you get very similar decision trees that split on the same feature. The random forest algorithm thus makes one modification to the bagged decision tree algorithm to try to further randomize the feature choice. At each node, when choosing a feature to use to split, if n features are available, pick a random subset of k < n features and allow the algorithm to only choose from that subset of features. This prevents as many similar trees.

Having a decision tree ensemble created using the random forest algorithm is more robust to changes in the data because it has already explored many different changes to the dataset, so an additional change will not affect the output as much.

0
Subscribe to my newsletter

Read articles from Sanika Nandpure directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sanika Nandpure
Sanika Nandpure

I'm a second-year student at the University of Texas at Austin with an interest in engineering, math, and machine learning.