Random Forest and Math behind it

kirubel Awokekirubel Awoke
6 min read

Random Forest is one of machine learning algorithms that is an ensemble method combining \(T\) decision trees \(\{ h_t \}_{t=1}^{T}\) , where each tree is trained on:

  1. A bootstrap sample from the dataset (bagging),

  2. A random subset of features at each split.

In Random Forests, we construct a large number \(K\)of decision trees \(h_1​(x),h_2​(x),…,h_K​(x)\). Each tree makes a prediction for an input sample \(X\).

The ensemble prediction is made by majority vote.

Each \(h_k​(X)∈Y\) outputs a class label.

Margin Function

In the context of ensemble classifiers like Random Forests, the margin function quantifies the confidence of the ensemble in its predictions.

For an input-output pair in order to compute the margin function we need a function that quantifies the strength of the most threatening incorrect class and collective accuracy measure indicator for a set of \(K\) classifiers.

The indicator-based function used in Random Forest is :

\(\text{mg}(X, Y) = \frac{1}{K}\sum_{k=1}^K I(h_k(X) = Y)\) ( 3.0 )

Components

\(mg(X,Y)\) : output function that takes inputs \(X\) and \(Y\) , where :

X is the input (often representing a feature vector in classification or a data point in a model).

Y is the true label or the target output.

Summation over \(k\) : The sum runs from \(k\) = 1 to K , suggesting that there are K functions ( or classifiers ) \(h_k\)​ involved in this process.

\(h_k​(X)\): The function \(h_k\)​ represents the output of the \(k\)-th model (or classifier) when applied to the input X. This could be a decision function (tree) or a classifier output.

\(I(h_k​(X)=Y):\)The term \(I\) is the indicator function, which returns:

1 if \(h_k(X)=Y\) (i.e., if the prediction of the \(k\)-th classifier matches the true label Y).

0 if \(h_k\)(X) \(\neq\) \(Y\) (i.e., if the prediction does not match the true label)

The second function needed to compute margin function for Random Forest model in machine learning is

\(\max_{j \neq Y} \left( \frac{1}{K} \sum_{k=1}^K I(h_k(X) = j) \right)\) ( 3.1 )

Components

K: Total number of classifiers (e.g., decision trees in Random Forest).

\(h_k​(X)\): Prediction of the \(k\)-th classifier for input X.

\(I(h_k​(X)=j)\): Indicator function:

\= 1 if classifier \(k\) predicts class \(j\)

\= 0 otherwise

\(\frac{1}{K}​∑^k_{k=1}​I(h_k​(X)=j)\): Fraction of classifiers that predicted class j.

\(j \) \(\neq\) Y : We only consider classes that are not the true label

Based on the above two equations (3.0) and (3.1) we compute the margin function

\(mg(X,Y)\) = \(\frac{1}{K}\sum_{k=1}^K I(h_k(X) = Y)\) - \(\max_{j \neq Y} \left( \frac{1}{K} \sum_{k=1}^K I(h_k(X) = j) \right)\) (3.2)

The margin measures the difference between the proportion of votes for the correct class and the highest proportion of votes for any incorrect class.

which means the fraction of votes for the correct class, minus the highest fraction of votes for any incorrect class.

if \(mg(X, Y) > 0\) then the majority voted for the correct class.

if \(mg(X, Y) < 0\) then the majority voted for the wrong class ( misclassification )

if \(mg(X, Y) = 0\) tie or uncertainty — could go either way.

Generalization Error

In supervised learning , generalization error is the probability that the model makes a mistake on the unseen data.

assuming (X, Y) \(∼\) P (X, Y) meaning the our data pairs are random draws but from a fixed, unknown joint distribution governing our problem we are solving.

first let us define and see what (X, Y) ∼ P (X, Y) means with example before delving into generalization error.

Random Forests for Early Alzheimer's Detection via MRI

Design a Random Forest model that can detect early-stage Alzheimer’s disease from brain MRI scans using supervised learning. This model is to be deployed in clinical settings globally, and it must generalize robustly beyond the training hospitals.

we got input value:

  • \(X∈R^{n×d}\) : Vectorized brain MRI scan features (e.g., extracted texture, HOG, intensity histograms, etc.)

output value :

\(Y∈\){ 0, 1, 2, 3 }: Alzheimer's stage (0 = healthy, 1 = mild, 2 = moderate, 3 = severe)

Assumption:

( X,Y ) ∼ P ( X , Y )

This assumption says:

There exists a true but unknown statistical law P(X,Y) governing how brain scans correspond to Alzheimer’s stages across the real human population.

Definition of Generalization Error

\(PE^∗\) = \(P_{X,Y}​(mg(X,Y)<0)\)

Thus, \(PE^*\) captures the expected error , of the forest when deployed.

Example: MRI Alzheimer’s Stage Classification via Random Forest

We are classifying brain MRI scans into 4 stages:

  • Class 0: No dementia (Healthy)

  • Class 1: Mild dementia

  • Class 2: Moderate dementia

  • Class 3: Severe dementia

We trained a Random Forest of 100 decision trees, and each one was trained on a different subset of the training data via bagging, i.e., bootstrap sampling.

Each of the 100 trees makes a prediction.Counting how many trees vote for each class:

Alzheimer StageNumber of Trees Voting

✅ Class 1 (True label Y=1Y = 1Y=1)

55

❌ Class 0

30

❌ Class 2

10

❌ Class 3

5

Step 1: Compute the Margin

Recalling the margin function from equation (3.2)

Our test MRI scan :

  • K=100

  • Y=1 (true class = mild dementia)

  • Votes for Class 1: 55

  • Most popular wrong class: Class 0 (30 votes)

Then:

\(mg(X,Y=1)=\frac{55}{100}​−max(\frac{30}{100}​,\frac{10}{100}​,\frac{5​}{100})\) = 0.55 - 0.3 = 0.25

Positive margin → Correct classification → This patient does not contribute to generalization error \(PE^*\)

the negative margin works like the same just the mg value becomes negative contributing to the generalization error \(PE^*\)

Convergence of Random Forests

Leo Breiman, the creator of Random Forests, proved that as the number of trees K → ∞ , the generalization error of the ensemble converges to a limit, thanks to the law of large numbers applied in the space of hypotheses.

As we increase the number of trees N → \(∞\) , the forest’s prediction converges to a limit function:

\(\hat{f}_\infty(x) = \lim_{N \to \infty} \frac{1}{N} \sum_{i=1}^N T_i(x)\)

This ensemble limit becomes stable and deterministic due to the law of large numbers — averaging over many independent trees reduces variance.

The Convergence Theorem

The theorem says:

\(\mathbb{P}{E^*} \to \mathbb{P}_{X,Y} \left( \mathbb{P}_{\Theta} (h(X, \Theta) = Y) - \max_{j \neq Y} \mathbb{P}_{\Theta} (h(X, \Theta) = j) < 0 \right)\)

Let’s deconstruct this expression like an elite scientist.

Components :

( X,Y ) ∼ P ( X,Y ): Random pair drawn from the joint data distribution.

\(h(X,Θ)\): A single decision tree classifier, parameterized by random vector \(Θ\).

\(P_Θ​(h(X,Θ)=j)\): The probability that a randomly chosen tree predicts class \(j\) for input XXX. This is taken over all possible trees.

\(P_{x,y}\)(⋅): Expectation over the entire data distribution.

This is a probabilistic expectation or measure defined over the joint distribution of inputs X and outputs Y, which, in our domain, corresponds to:

X: MRI scan of a patient’s brain

Y: Ground-truth clinical stage of Alzheimer's (e.g., normal, mild cognitive impairment, moderate dementia, etc.)

It's the joint probability distribution over the input space \(X\) (images) and output space \(Y\)(labels).

It describes the true, unknown data-generating process from which all our samples are drawn.

and it is not the training set — it's the entire universe of all possible (MRI, diagnosis) pairs.

Strength and Correlation Strength

The strength of a Random Forest is defined as the expected margin over the true distribution \(P_{x,y}\).

s = \(E_{X,Y​}[mr(X,Y)]\)

Where:

mr(X, Y): the margin function for sample (X, Y)

The margin quantifies how much more the correct class is voted for than the strongest incorrect class — a proxy for classification confidence.

2. Correlation \(\bar p\) — "Tree Agreement on Mistakes"

Correlation measures the statistical dependency between the errors of different trees.

Let :

\(h_i​(X)\): prediction from tree \(i\)

Defines the error indicator \(Z_i(X,Y) = \mathbb{1}[h_i(X) \neq Y]\)

Then the correlation between errors of trees \(i\) and \(j\) :

\(\rho_{ij} = \text{Corr}(Z_i(X,Y), Z_j(X,Y))\)

0
Subscribe to my newsletter

Read articles from kirubel Awoke directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

kirubel Awoke
kirubel Awoke