🧼 Data Preprocessing in Machine Learning: Clean Data, Better Models

Table of contents
- 🧠 Introduction
- 🧱 Why Preprocessing Matters
- 🧼 Common Preprocessing Steps
- 1. 🧩 Handling Missing Values
- 2. 🔡 Encoding Categorical Variables
- 3. ⚖️ Feature Scaling
- 4. 🚨 Outlier Detection and Removal
- 5. 🧪 Train-Test Split
- 🔁 Feature Transformation (Optional)
- ⚖️ Handling Imbalanced Data (Optional)
- 🧠 Correlation Matrix and Feature Selection (Optional)
- 🛠️ Preprocessing with Pipelines(Optional)
- 🧪 Python Code Example
- 🧩 Final Thoughts
- 📬 Subscribe

“80% of machine learning is data preparation — not model tuning.”
— Tilak Savani
🧠 Introduction
Before you feed data into a machine learning model, you must clean and prepare it. This step is called data preprocessing.
It ensures your model performs well, generalizes to new data, and gives trustworthy results.
Even the most powerful algorithms like Random Forest or Neural Networks can’t save a bad dataset.
🧱 Why Preprocessing Matters
✅ Reduces noise and inconsistency
✅ Improves model performance and accuracy
✅ Prevents bias due to missing or skewed data
✅ Enables learning from different feature types (text, numbers, categories)
🧼 Common Preprocessing Steps
1. 🧩 Handling Missing Values
Real-world data is rarely perfect. You’ll often encounter NaN
, blanks, or NULLs.
Options:
Remove rows or columns with too many missing values
Fill with:
Mean / Median (numerical)
Mode (categorical)
Custom values or interpolation
df['age'].fillna(df['age'].mean(), inplace=True)
2. 🔡 Encoding Categorical Variables
ML models don’t work with strings. Convert categories into numbers:
Techniques:
Label Encoding: Assigns 0, 1, 2... to categories
(works for ordinal data)One-Hot Encoding: Creates separate binary columns
(best for nominal data)
# One-hot encoding
pd.get_dummies(df['gender'], drop_first=True)
3. ⚖️ Feature Scaling
Features should be on similar scales, especially for models like SVM, KNN, or Logistic Regression.
Common methods:
Min-Max Scaling (0 to 1)
Standardization (mean = 0, std = 1)
pythonCopyEditfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df[['height', 'weight']])
4. 🚨 Outlier Detection and Removal
Outliers can skew your model’s predictions and performance.
How to handle:
Z-Score: Values beyond ±3 standard deviations
IQR Method: Anything outside Q1 - 1.5×IQR or Q3 + 1.5×IQR
pythonCopyEditfrom scipy import stats
df = df[(np.abs(stats.zscore(df['salary'])) < 3)]
5. 🧪 Train-Test Split
Before training, split your dataset to evaluate generalization.
pythonCopyEditfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
🔁 Feature Transformation (Optional)
In real-world data, some features are highly skewed. Applying transformations like log, square root, or Box-Cox can make them more normally distributed — improving model performance.
pythonCopyEditdf['income'] = np.log1p(df['income']) # log transform to reduce skew
⚖️ Handling Imbalanced Data (Optional)
If one class dominates your dataset (e.g., 90% class A, 10% class B), most models will just predict the majority class.
Use SMOTE, undersampling, or class weights to fix this.
pythonCopyEditfrom imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
🧠 Correlation Matrix and Feature Selection (Optional)
Use a heatmap to spot and remove highly correlated features.
This reduces multicollinearity and improves model interpretability.
pythonCopyEditimport seaborn as sns
import matplotlib.pyplot as plt
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Feature Correlation Matrix")
plt.show()
🛠️ Preprocessing with Pipelines(Optional)
Instead of manually transforming each step, wrap everything into a scikit-learn pipeline. It’s cleaner, repeatable, and production-ready.
pythonCopyEditfrom sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
🧪 Python Code Example
Here’s a mini pipeline putting all preprocessing steps together:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load data
df = pd.read_csv('data.csv')
# Handle missing values
df.fillna(df.mean(numeric_only=True), inplace=True)
# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)
# Remove outliers (example: salary)
from scipy import stats
df = df[(np.abs(stats.zscore(df['salary'])) < 3)]
# Split features/target
X = df.drop('target', axis=1)
y = df['target']
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
🧩 Final Thoughts
Without proper preprocessing, your models may:
❌ Overfit
❌ Miss patterns
❌ Give misleading results
Data preprocessing is not optional — it’s essential.
“Better data beats a better algorithm.”
— Tilak Savani
📬 Subscribe
If you found this helpful, follow me on Hashnode for more machine learning blogs that simplify complex concepts with math, code, and real-world examples.
Subscribe to my newsletter
Read articles from Tilak Savani directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
