Data Balancing in Machine Learning

kirubel Awokekirubel Awoke
4 min read

Imbalanced classification refers to a scenario where the target classes do not have equal representation. For example, in medical diagnosis:

  • Class 0 (Rare Disease): 50 samples

  • Class 1 (Common Case): 5000 samples

Why Does Data Imbalance Matter?

  • Bias in Model Learning: Models tend to learn from the majority class and might ignore or misclassify instances from the minority class.

  • Skewed Performance Metrics: With imbalanced classes, accuracy can be a misleading metric. A model that always predicts the majority class may appear to have high accuracy, but it could fail at detecting the minority class.

  • Suboptimal Generalization: An imbalanced dataset often leads to poor model generalization, especially in rare event detection, anomaly detection, or classifying edge cases.

    Over-sampling

Naive random over-sampling

One way to fight this issue is to generate new samples in the classes which are under-represented. The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples.

RandomOverSampler

Purpose:

To correct class imbalance by randomly duplicating examples in the minority class until all classes are equally represented (or to a desired ratio).

It performs naive over-sampling, meaning:

  • No new data is synthesized.

  • It simply duplicates existing samples from the minority class randomly with replacement.

Will use the make_classification class which is a data simulation utility in scikit-learn used to generate a synthetic classification dataset, allowing controlled experimentation.

Here we let us see the output before applying RandomOverSampler to the data.
Before resampling: There are very few data points from the minority classes

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import Counter
from imblearn.over_sampling import RandomOverSampler

# Step 1: Generate an imbalanced dataset
X, y = make_classification(n_samples=1000,
                           n_features=3,          # 3 total features
                           n_informative=3,       # all 3 informative
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=3,
                           n_clusters_per_class=2,
                           class_sep=0.9,
                           weights=[0.01, 0.05, 0.94],  # Heavy class imbalance
                           random_state=42)

print(f"\n📊 Before Resampling: {sorted(Counter(y).items())}\n")

# Step 2: Plot before resampling
df_before = pd.DataFrame(X[:, :2], columns=['Feature 1', 'Feature 2'])  # Only first 2 features for 2D plotting
df_before['Label'] = y

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_before, x='Feature 1', y='Feature 2', hue='Label',
                palette='Set1', s=60, edgecolor='k', alpha=0.8)
plt.title('💡 Before Resampling (Imbalanced Data)')
plt.grid(True)
plt.legend(title='Class')
plt.tight_layout()
plt.show()
📊 Before Resampling: [(np.int64(0), 12), (np.int64(1), 55), (np.int64(2), 933)]

After resampling: The minority classes are oversampled (by duplicating points), but since the exact points are repeated, the visual appearance doesn't change significantly. The class distribution is now more balanced:

After Resampling: [(np.int64(0), 933), (np.int64(1), 933), (np.int64(2), 933)]

let us visualize the difference in the logistic regression model’s decision boundary before and after resampling.
how resampling influences the decision boundary of a classification model like Logistic Regression, especially when facing class imbalance.

In classification, a decision boundary is the region in the feature space where the model is uncertain and transitions from predicting one class to another.

For a 3-class classifier, the boundary divides the 2D feature space into three colored regions — one for each class.

It shows how the model separates different classes based on input features.

# Train the Logistic Regression model with all 3 features
clf_original = LogisticRegression()
clf_original.fit(X, y)

# To make predictions on a 2D space for visualization, we need to create meshgrid
xx, yy = np.meshgrid(np.linspace(X[:, 0].min(), X[:, 0].max(), 100),
                     np.linspace(X[:, 1].min(), X[:, 1].max(), 100))

# Use all 3 features for the predictions (create dummy values for the 3rd feature)
Z_original = predict(np.c_[xx.ravel(), yy.ravel(), np.zeros_like(xx.ravel())])

# Reshape the prediction result
Z_original = Z_original.reshape(xx.shape)

# Visualize the decision boundary
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette="Set1", edgecolor="k", s=60)
plt.contourf(xx, yy, Z_original, alpha=0.3, cmap='coolwarm')
plt.title('Logistic Regression (Original Data)')
plt.show()

  • In the imbalanced dataset, the decision boundary is likely to be skewed towards the dominant class.

  • Minority classes have fewer samples, so the decision boundary might be biased, not properly separating the minority class from others.

  • The dominant class skews the decision regions heavily in its favor.

  • Rare classes are often underrepresented or ignored in boundary formation.

After Resampling:

  • After using RandomOverSampler, we have balanced the class distribution.

  • The model now has more data for the minority classes, and the decision boundary adjusts accordingly.

  • The boundary will be fairer and more accurate in classifying all classes, including the minority ones.

In the next blog post we will see SMOTE (Synthetic Minority Over-sampling Technique) with in depth and detailed explanation.

1
Subscribe to my newsletter

Read articles from kirubel Awoke directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

kirubel Awoke
kirubel Awoke