Data Balancing in Machine Learning

Imbalanced classification refers to a scenario where the target classes do not have equal representation. For example, in medical diagnosis:
Class 0 (Rare Disease): 50 samples
Class 1 (Common Case): 5000 samples
Why Does Data Imbalance Matter?
Bias in Model Learning: Models tend to learn from the majority class and might ignore or misclassify instances from the minority class.
Skewed Performance Metrics: With imbalanced classes, accuracy can be a misleading metric. A model that always predicts the majority class may appear to have high accuracy, but it could fail at detecting the minority class.
Suboptimal Generalization: An imbalanced dataset often leads to poor model generalization, especially in rare event detection, anomaly detection, or classifying edge cases.
Over-sampling
Naive random over-sampling
One way to fight this issue is to generate new samples in the classes which are under-represented. The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples.
RandomOverSampler
Purpose:
To correct class imbalance by randomly duplicating examples in the minority class until all classes are equally represented (or to a desired ratio).
It performs naive over-sampling, meaning:
No new data is synthesized.
It simply duplicates existing samples from the minority class randomly with replacement.
Will use the make_classification
class which is a data simulation utility in scikit-learn used to generate a synthetic classification dataset, allowing controlled experimentation.
Here we let us see the output before applying RandomOverSampler
to the data.
Before resampling: There are very few data points from the minority classes
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
# Step 1: Generate an imbalanced dataset
X, y = make_classification(n_samples=1000,
n_features=3, # 3 total features
n_informative=3, # all 3 informative
n_redundant=0,
n_repeated=0,
n_classes=3,
n_clusters_per_class=2,
class_sep=0.9,
weights=[0.01, 0.05, 0.94], # Heavy class imbalance
random_state=42)
print(f"\n📊 Before Resampling: {sorted(Counter(y).items())}\n")
# Step 2: Plot before resampling
df_before = pd.DataFrame(X[:, :2], columns=['Feature 1', 'Feature 2']) # Only first 2 features for 2D plotting
df_before['Label'] = y
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_before, x='Feature 1', y='Feature 2', hue='Label',
palette='Set1', s=60, edgecolor='k', alpha=0.8)
plt.title('💡 Before Resampling (Imbalanced Data)')
plt.grid(True)
plt.legend(title='Class')
plt.tight_layout()
plt.show()
📊 Before Resampling: [(np.int64(0), 12), (np.int64(1), 55), (np.int64(2), 933)]
After resampling: The minority classes are oversampled (by duplicating points), but since the exact points are repeated, the visual appearance doesn't change significantly. The class distribution is now more balanced:
After Resampling: [(np.int64(0), 933), (np.int64(1), 933), (np.int64(2), 933)]
let us visualize the difference in the logistic regression model’s decision boundary before and after resampling.
how resampling influences the decision boundary of a classification model like Logistic Regression, especially when facing class imbalance.
In classification, a decision boundary is the region in the feature space where the model is uncertain and transitions from predicting one class to another.
For a 3-class classifier, the boundary divides the 2D feature space into three colored regions — one for each class.
It shows how the model separates different classes based on input features.
# Train the Logistic Regression model with all 3 features
clf_original = LogisticRegression()
clf_original.fit(X, y)
# To make predictions on a 2D space for visualization, we need to create meshgrid
xx, yy = np.meshgrid(np.linspace(X[:, 0].min(), X[:, 0].max(), 100),
np.linspace(X[:, 1].min(), X[:, 1].max(), 100))
# Use all 3 features for the predictions (create dummy values for the 3rd feature)
Z_original = predict(np.c_[xx.ravel(), yy.ravel(), np.zeros_like(xx.ravel())])
# Reshape the prediction result
Z_original = Z_original.reshape(xx.shape)
# Visualize the decision boundary
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette="Set1", edgecolor="k", s=60)
plt.contourf(xx, yy, Z_original, alpha=0.3, cmap='coolwarm')
plt.title('Logistic Regression (Original Data)')
plt.show()
In the imbalanced dataset, the decision boundary is likely to be skewed towards the dominant class.
Minority classes have fewer samples, so the decision boundary might be biased, not properly separating the minority class from others.
The dominant class skews the decision regions heavily in its favor.
Rare classes are often underrepresented or ignored in boundary formation.
After Resampling:
After using RandomOverSampler, we have balanced the class distribution.
The model now has more data for the minority classes, and the decision boundary adjusts accordingly.
The boundary will be fairer and more accurate in classifying all classes, including the minority ones.
In the next blog post we will see SMOTE (Synthetic Minority Over-sampling Technique) with in depth and detailed explanation.
Subscribe to my newsletter
Read articles from kirubel Awoke directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
