StandardScaler to numerical features and OneHotEncoder to categorical features.

When working with real-world datasets, we often have:
Numerical features (e.g., Age, Salary)
Categorical features (e.g., Gender, Country)
To improve model performance, we need to:
✔ Standardize numerical features (for algorithms sensitive to scale, like Logistic Regression, SVM, etc.)
✔ Encode categorical features into a format models can understand (machine learning models can’t process raw text categories).
The given code prepares data by applying Standardization to numerical features and One-Hot Encoding to categorical features.
The goal is to:
1️⃣ Transform numerical features using StandardScaler (scaling data to a mean of 0 and standard deviation of 1).
2️⃣ Convert categorical features using OneHotEncoder (converting text categories into numerical values).
3️⃣ Combine both transformations using ColumnTransformer
for a seamless data preprocessing pipeline.
Breaking Down the Code
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
🔹 OneHotEncoder
: Converts categorical values into a binary matrix (e.g., "Red", "Blue", "Green" → [1,0,0]
, [0,1,0]
, [0,0,1]
).
🔹 StandardScaler
: Scales numerical features by subtracting the mean and dividing by the standard deviation.
🔹 ColumnTransformer
: Applies transformations to different columns simultaneously.
Creating Transformers
numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()
numeric_transformer = StandardScaler()
: Standardizes numerical features.oh_transformer = OneHotEncoder()
: Encodes categorical features.
Applying ColumnTransformer
preprocessor = ColumnTransformer(
[
("OneHotEncoder", oh_transformer, cat_features),
("StandardScaler", numeric_transformer, num_features),
]
)
"OneHotEncoder", oh_transformer, cat_features
: Applies OneHotEncoder to categorical columns."StandardScaler", numeric_transformer, num_features
: Applies StandardScaler to numerical columns.ColumnTransformer
applies both transformations to different columns simultaneously.
The preprocessor object is now ready to:
✔ Transform categorical features into numerical vectors
✔ Standardize numerical features for better model performance
✔ Be used in a Machine Learning pipeline for consistent preprocessing
🚀 Final Takeaways
ColumnTransformer makes it easy to handle mixed data types (categorical + numerical).
OneHotEncoder is useful for encoding categorical variables for models.
StandardScaler ensures numerical features are on the same scale.
This approach is scalable & efficient for ML pipelines.
Subscribe to my newsletter
Read articles from Ashok Vanga directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Ashok Vanga
Ashok Vanga
Golang Developer and Blockchain certified professional