Decision Trees

garv aggarwalgarv aggarwal
3 min read

A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It works like a flowchart to make decisions based on input features by splitting data into subsets based on feature values.

Terminologies :

  • Root Node:
    The topmost node that represents the entire dataset. It is split into subsets based on a feature.

  • Decision Nodes:
    Internal nodes where decisions are made to split further.

  • Leaf Nodes (Terminal Nodes):
    Nodes that represent the final output or prediction.

  • Branches:
    The outcome of a decision, leading to another node.

Advantages :

  • Minimal data preparation is required.

  • Intuitive and easy to understand.

  • The cost of using the tree for inference is logarithmic in the number data points used to train the tree.

Disadvantages :

  • Overfitting of the model is a big problem in the training of decision tree of the model.

  • Prone to errors for imbalanced datasets.

Entropy :

Entropy is nothing but the measure of disorder. Or you can also call it the measure of purity/impurity. More reduces the entropy or randomness in the data.

Where Pi is simply the frequentist probability of an element/class i in our data.

  • More the uncertainty more is entropy

  • For a 2 class problem the min. entropy is 0 and max. entropy is 1. Entropy is 0 when all the observation belong to one class and Entropy is 1 when observation are exactly 50-50 between both observations.

  • For a more than 2 class problem the min. is 0 but max. can be more than 1.

  • Both log base 2 and log base e can be used to calculate entropy.

Entropy vs Probability :

Information Gain :

Information Gain measures how much "information" a feature gives us about the class. It’s based on the concept of entropy, which measures the impurity or uncertainty in a dataset. Information Gain (IG) is a key concept used in Decision Tree algorithms to decide which attribute to split the data on at each step while building the tree.

Where:

  • D = parent dataset

  • Di = subset after split on an attribute

  • ∣Di∣/∣D∣ = weight of subset

Gini impurity :

Gini Impurity is another metric (besides entropy) used to decide the best attribute to split the data at each node in a Decision Tree, especially in algorithms like CART (Classification and Regression Trees). Gini Impurity measures the probability that a randomly chosen sample would be incorrectly classified if it was labeled according to the class distribution in the dataset.

Where:

  • D = dataset at a node

  • c = number of classes

  • Pi​ = proportion of samples belonging to class i

0
Subscribe to my newsletter

Read articles from garv aggarwal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

garv aggarwal
garv aggarwal