StandardScaler to numerical features and OneHotEncoder to categorical features.

Ashok VangaAshok Vanga
2 min read

When working with real-world datasets, we often have:

  • Numerical features (e.g., Age, Salary)

  • Categorical features (e.g., Gender, Country)

To improve model performance, we need to:
Standardize numerical features (for algorithms sensitive to scale, like Logistic Regression, SVM, etc.)
Encode categorical features into a format models can understand (machine learning models can’t process raw text categories).

The given code prepares data by applying Standardization to numerical features and One-Hot Encoding to categorical features.

The goal is to:
1️⃣ Transform numerical features using StandardScaler (scaling data to a mean of 0 and standard deviation of 1).
2️⃣ Convert categorical features using OneHotEncoder (converting text categories into numerical values).
3️⃣ Combine both transformations using ColumnTransformer for a seamless data preprocessing pipeline.


Breaking Down the Code

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

🔹 OneHotEncoder: Converts categorical values into a binary matrix (e.g., "Red", "Blue", "Green" → [1,0,0], [0,1,0], [0,0,1]).
🔹 StandardScaler: Scales numerical features by subtracting the mean and dividing by the standard deviation.
🔹 ColumnTransformer: Applies transformations to different columns simultaneously.


Creating Transformers

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()
  • numeric_transformer = StandardScaler(): Standardizes numerical features.

  • oh_transformer = OneHotEncoder(): Encodes categorical features.


Applying ColumnTransformer

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features),        
    ]
)
  • "OneHotEncoder", oh_transformer, cat_features: Applies OneHotEncoder to categorical columns.

  • "StandardScaler", numeric_transformer, num_features: Applies StandardScaler to numerical columns.

  • ColumnTransformer applies both transformations to different columns simultaneously.

The preprocessor object is now ready to:
Transform categorical features into numerical vectors
Standardize numerical features for better model performance
Be used in a Machine Learning pipeline for consistent preprocessing


🚀 Final Takeaways

  • ColumnTransformer makes it easy to handle mixed data types (categorical + numerical).

  • OneHotEncoder is useful for encoding categorical variables for models.

  • StandardScaler ensures numerical features are on the same scale.

  • This approach is scalable & efficient for ML pipelines.

0
Subscribe to my newsletter

Read articles from Ashok Vanga directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ashok Vanga
Ashok Vanga

Golang Developer and Blockchain certified professional