🧼 Data Preprocessing in Machine Learning: Clean Data, Better Models

Tilak SavaniTilak Savani
4 min read

“80% of machine learning is data preparation — not model tuning.”
Tilak Savani



🧠 Introduction

Before you feed data into a machine learning model, you must clean and prepare it. This step is called data preprocessing.
It ensures your model performs well, generalizes to new data, and gives trustworthy results.

Even the most powerful algorithms like Random Forest or Neural Networks can’t save a bad dataset.


🧱 Why Preprocessing Matters

✅ Reduces noise and inconsistency
✅ Improves model performance and accuracy
✅ Prevents bias due to missing or skewed data
✅ Enables learning from different feature types (text, numbers, categories)


🧼 Common Preprocessing Steps

1. 🧩 Handling Missing Values

Real-world data is rarely perfect. You’ll often encounter NaN, blanks, or NULLs.

Options:

  • Remove rows or columns with too many missing values

  • Fill with:

    • Mean / Median (numerical)

    • Mode (categorical)

    • Custom values or interpolation

df['age'].fillna(df['age'].mean(), inplace=True)

2. 🔡 Encoding Categorical Variables

ML models don’t work with strings. Convert categories into numbers:

Techniques:

  • Label Encoding: Assigns 0, 1, 2... to categories
    (works for ordinal data)

  • One-Hot Encoding: Creates separate binary columns
    (best for nominal data)

# One-hot encoding
pd.get_dummies(df['gender'], drop_first=True)

3. ⚖️ Feature Scaling

Features should be on similar scales, especially for models like SVM, KNN, or Logistic Regression.

Common methods:

  • Min-Max Scaling (0 to 1)

  • Standardization (mean = 0, std = 1)

pythonCopyEditfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df[['height', 'weight']])

4. 🚨 Outlier Detection and Removal

Outliers can skew your model’s predictions and performance.

How to handle:

  • Z-Score: Values beyond ±3 standard deviations

  • IQR Method: Anything outside Q1 - 1.5×IQR or Q3 + 1.5×IQR

pythonCopyEditfrom scipy import stats
df = df[(np.abs(stats.zscore(df['salary'])) < 3)]

5. 🧪 Train-Test Split

Before training, split your dataset to evaluate generalization.

pythonCopyEditfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

🔁 Feature Transformation (Optional)

In real-world data, some features are highly skewed. Applying transformations like log, square root, or Box-Cox can make them more normally distributed — improving model performance.

pythonCopyEditdf['income'] = np.log1p(df['income'])  # log transform to reduce skew

⚖️ Handling Imbalanced Data (Optional)

If one class dominates your dataset (e.g., 90% class A, 10% class B), most models will just predict the majority class.
Use SMOTE, undersampling, or class weights to fix this.

pythonCopyEditfrom imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

🧠 Correlation Matrix and Feature Selection (Optional)

Use a heatmap to spot and remove highly correlated features.
This reduces multicollinearity and improves model interpretability.

pythonCopyEditimport seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Feature Correlation Matrix")
plt.show()

🛠️ Preprocessing with Pipelines(Optional)

Instead of manually transforming each step, wrap everything into a scikit-learn pipeline. It’s cleaner, repeatable, and production-ready.

pythonCopyEditfrom sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

pipeline.fit(X_train, y_train)

🧪 Python Code Example

Here’s a mini pipeline putting all preprocessing steps together:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv('data.csv')

# Handle missing values
df.fillna(df.mean(numeric_only=True), inplace=True)

# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)

# Remove outliers (example: salary)
from scipy import stats
df = df[(np.abs(stats.zscore(df['salary'])) < 3)]

# Split features/target
X = df.drop('target', axis=1)
y = df['target']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

🧩 Final Thoughts

Without proper preprocessing, your models may:

❌ Overfit
❌ Miss patterns
❌ Give misleading results

Data preprocessing is not optional — it’s essential.

“Better data beats a better algorithm.”
Tilak Savani


📬 Subscribe

If you found this helpful, follow me on Hashnode for more machine learning blogs that simplify complex concepts with math, code, and real-world examples.

0
Subscribe to my newsletter

Read articles from Tilak Savani directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tilak Savani
Tilak Savani