Let's bring the squad out


Because one detective isn’t always enough.
A single decision tree is like a lone detective on a case. It works well when the clues are clean and the pattern is clear.
But real-world data? It’s messy. Noisy. Conflicting. One suspect lied about their alibi. Another left prints at the scene but wasn’t involved. The patterns blur. And suddenly, our brilliant detective starts making bad calls - overfitting to edge cases, or flipping their logic over one misleading clue.
That’s when we bring in the squad.
Random Forests don’t rely on one investigator. They bring a team. A diverse set of decision trees - each trained slightly differently, each with its own perspective on the evidence. One tree might focus on motive. Another might care more about fingerprints. A third might ignore both and look at the timeline.
Individually, they might be inconsistent. But together? They vote. And that collective judgment is usually far more accurate - and far more stable - than any single tree could ever be.
It’s not just about better accuracy. It’s about better reasoning under uncertainty.
Inside the Random Forest
So how do you go from a lone detective to a full-blown investigative task force?
You don’t just clone the same tree over and over. That would give you a hundred versions of the same bias - louder, not smarter. The real strength of a Random Forest lies in diversity. Each tree brings its own take on the evidence.
Shuffle the case files
Each tree in the forest doesn’t see the entire dataset. Instead, it gets a random sample, drawn with replacement. That means some examples show up more than once. Some don’t show up at all.
This is called bootstrapping. And it ensures that each tree sees a slightly different view of the world - which helps the forest as a whole avoid overfitting to any one version of the truth.
Change the lens
When a decision tree tries to split the data - say, to separate guilty from innocent - it looks for the best question to ask based on the features it has.
But in a Random Forest, each tree doesn’t get to look at all the features when making a split.
Instead, at every split point, the tree is given a random subset of the total features, and it must choose the best split from only those. For example, if your dataset has 10 features, the tree might be allowed to choose from just 3 of them at a time.
This process is called feature bagging. It forces each tree to make slightly different decisions - even if they’ve seen similar training data.
So one tree might split first on “alibi confirmed,” while another might start with “fingerprint found.” The result is a diverse forest - where each tree builds its own logic path, and overfitting is less likely to spread across the whole model.
Let the team vote
Once all the trees are trained, predictions are made by passing the new input through every tree in the forest.
For classification: each tree casts a vote for a class label. The majority wins.
For regression: the predictions are averaged.
Because the trees are trained differently and split differently, they often disagree on edge cases. But when combined, their individual errors tend to cancel each other out.
This ensemble strategy eads to a final prediction that is more accurate, stable, and resistant to overfitting than any individual decision tree.
Enter: The Squad
One more robbery. Many suspects. We’re scaling up the investigation.
Each row in our dataset is a suspect. Each column is a clue: were they seen near the scene? Did they have a confirmed alibi? Were their fingerprints found?
Our goal is to train a Random Forest to predict whether someone is guilty - not by relying on a single chain of reasoning, but by bringing in 100 slightly-different trees, each trained to investigate the case from its own angle.
# Step 1: Import the libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Step 2: Simulate a robbery investigation dataset
np.random.seed(42)
num_samples = 500
data = pd.DataFrame({
"seen_near_scene": np.random.randint(0, 2, num_samples),
"has_prior_convictions": np.random.randint(0, 2, num_samples),
"alibi_confirmed": np.random.randint(0, 2, num_samples),
"fingerprints_found": np.random.randint(0, 2, num_samples),
"matches_description": np.random.randint(0, 2, num_samples),
"caught_on_cctv": np.random.randint(0, 2, num_samples),
})
# Step 3: Define who's guilty: no alibi, seen near scene, plus some hard evidence
data["guilty"] = (
(data["seen_near_scene"] == 1)
& (data["alibi_confirmed"] == 0)
& ((data["fingerprints_found"] == 1) | (data["caught_on_cctv"] == 1))
).astype(int)
# Step 4: Separate clues (features) from the outcome (guilty or not)
X = data.drop("guilty", axis=1)
y = data["guilty"]
# Step 5: Train the investigative task force
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X, y)
When the Squad Gets It Wrong
Random Forests are powerful. They reduce overfitting, handle noisy data better, and rarely fall for the same trap twice. But they’re not perfect.
Sometimes, even a hundred detectives can talk each other into the wrong conclusion.
They still overfit - just less
Random Forests are far more stable than a single decision tree, but overfitting can still happen - especially if the trees are deep and the dataset is small or noisy.
Each tree in the forest might overfit a little, and while averaging helps, it can’t fix everything. If all your trees are memorizing noise, the forest will too - just more politely.
To avoid this, you’ll still want to set limits:
max_depth
: how far each tree is allowed to digmin_samples_split
: the smallest group a node can try to splitmax_features
: how many clues a tree gets to consider at each decision point
They're harder to interpret (and that can matter)
A single decision tree is easy to follow - you can trace exactly how it reached a decision: one question at a time.
A Random Forest? Not really.
Since it’s made up of dozens (or hundreds) of different trees, each trained on different slices of data and considering different features, there’s no single clear path you can point to for a specific decision. The final output is just the result of a vote.
And in many cases, that’s fine.
But in domains where decisions need to be explained - like healthcare, finance, or criminal justice - this becomes a real limitation. If a model denies someone a loan or predicts high medical risk, people will want to know why.
Random Forests can tell you which features mattered on average - but not why this prediction happened for this input.
So if you're working in a setting where interpretability is part of the job, a Random Forest might not be your best lead.
They get heavy fast
Training 100 trees, each with their own bootstrapped data and random feature splits? That’s computationally expensive.
Random Forests are slower to train and can be memory-hungry - especially with large datasets or many features. They’re still faster than deep learning models, but slower than simpler ones like logistic regression or a single tree.
You get power - but you pay for it.
They're still greedy
Each tree builds itself using the best split at every step - no looking ahead. That means the forest is still made up of trees that are locally smart, but globally short-sighted.
While randomness and voting help avoid really bad outcomes, the forest still isn’t guaranteed to find the absolute best way to structure the decision space.
It just finds a way that works well enough.
Final Verdict
Random Forests take everything that makes decision trees appealing - clear reasoning, simple logic, no math PhD required - and scale it into something much stronger.
They handle messy data. They shrug off noise. And they don’t panic when one clue doesn’t line up. Instead, they look at the case from a hundred different angles and make a call as a team.
Not perfect. But smart. And far more grounded than any one detective going solo.
Of course, it’s not all upside. You lose some clarity. You add some bulk. And you can’t always follow how they got there.
But if you want a model that just works - one that’s fast, stable, and rarely falls for the wrong lead - the forest’s a good place to start.
They may not always be right. But they rarely all agree for the wrong reasons.
Subscribe to my newsletter
Read articles from Saanvi Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Saanvi Kumar
Saanvi Kumar
Hey, I’m Saanvi 👋 This blog is mostly me talking to myself - trying to make sense of whatever I’m learning right now. It could be a bug I spent hours fixing, a concept that finally clicked, or just something cool I came across. Writing it down helps me remember, reflect, and sometimes even feel like I know what i’m doing. If you’re here and figuring things out too - welcome to the mess.