Data Balancing in ML using SMOTE

kirubel Awokekirubel Awoke
3 min read

(SMOTE)
As we have seen earlier, one of the methods for data balancing in ML is using naive random over-sampling by generating new samples in the underrepresented classes by randomly duplicating samples from the minority classes until all classes are equally represented.

Now we will see another method, which is more robust, called Synthetic Minority Oversampling Technique (SMOTE).

SMOTE is an over-sampling technique that synthetically generates new minority-class samples rather than duplicating existing ones. Instead of duplication, SMOTE interpolates between existing minority samples.

Instead of duplicating the existing one SMOTE interpolates between the existing minority samples.

Mathematical Formulation

Let:

  • \( x_i \in \mathbb{R}^d \) : a sample from the minority class.

  • \( x_{\text{nn}} \) : one of the \( k \) -nearest neighbors of \( x_i \) (also from the minority class).

  • \( \lambda \sim \mathcal{U}(0,1) \) : a random scalar from a uniform distribution over \( [0,1] \) .

A new synthetic sample \( x_{\text{new}} \) is generated via linear interpolation:

$$x_{\text{new}} = x_i + \lambda (x_{\text{nn}} - x_i)$$

The expression generates new points along the line segment between a minority point \(x_i\) and one of its neighbors \(x_{\text{in}}\) , i.e interpolated synthetic data, rather than duplicating like in RandomOverSampler.

The parameter \( k \) defines the number of neighbors used for interpolation.

More neighbors → more diversity in the synthetic samples, however large k introduces risk if the generated new synthetic samples fall into another cluster or classes. and SMOTE uses local minority clusters to interpolate new data.

\(\lambda \sim U(0,1) \) - Random Interpolation Coefficient

This is a scalar drawn from the continuous uniform distribution.

$$\lambda \in [0,1], \lambda \sim U(0,1)$$

it controls how far between \( x_i \) and \(x_{\text{in}}\) the new synthetic point is placed.

\(\lambda = 0 => x_{\text{new}} = x_i\) (no interpolation)

\( \lambda = 1 \) \=> \( x_{\text{new}} = x_{\text{in}} \) (full interpolation)

\( \lambda = 0.5 \) \=> midpoint

Geometric interpretation

The vector \(x_{\text{in}} - x_i\) points from \(x_i\) toward a neighbor. Scaling it by \(\lambda \in [0,1]\) gives a point along the line segment joining them. Hence:

\(x_{\text{new}} \in \text{Convex Hull}(x_i, x_{\text{in}})\)

This preserves the local topology of the minority class and generates synthetic data in a controlled region of feature space.

Smote offer three additional options.

BorderlineSMOTE, SVMSMOTE, KmeansSMOTE

BorderlineSMOTE

it improves upon Normal SMOTE by focusing synthetic sample generation upon the border line of minority classes (those near the class boundary ) where the rich of misclassification is high.

Idea of BorderlineSMOTE

let \(x_i \in \mathbb{R}^d\) be a minority sample class. Its neighborhood is analyzed using k-nearest neighbor (k-NN).

BorderlineSMOTE classify each of the minority sample into 3 types:

1. Safe : Most neighbors are minority

2. Danger : About Half of the neighbors are majority – on the decision boundary

3. Noise : Most neighbor are majority – possible mislabeled or outliers

BorderlineSMOTE uses “danger” points to generate new synthetic data.

Algorithm: BorderlineSMOTE Step-by-Step

let :

\(D_{\min}\) = set of minority class samples

\(k\) = number of neighbors

Step 1 : Identify “Danger” samples

Classification of Minority Samples:
For each \( x_i \in D_{\text{min}} \) :

  1. Find its \( k \) -nearest neighbors.

  2. Count majority-class neighbors \( m \) .

  3. Classify \( x_i \) as:

    • Safe: \( m = 0 \) or \( m < \frac{k}{2} \)

    • Danger: \( \frac{k}{2} \leq m < k \)

    • Noise: \( m = k \)

Only the danger samples are retained for oversampling.

Step2. Generate Synthetic Samples

For each danger sample \(x_i\)

Choose one of its minority class neighbor \(x_{\text{in}}\)

Generate a synthetic sample as :

Synthetic Sample Generation (for Danger points only):

$$x_{\text{new}} = x_i + \lambda(x_{\text{n}} - x_i)$$

1
Subscribe to my newsletter

Read articles from kirubel Awoke directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

kirubel Awoke
kirubel Awoke