When working with real-world datasets, we often have:

Numerical features (e.g., Age, Salary)
Categorical features (e.g., Gender, Country)

To improve model performance, we need to:
✔ Standardize numerical features (for algorithms sensitive to scale, like Logistic Regression, SVM, etc.)
✔ Encode categorical features into a format models can understand (machine learning models can’t process raw text categories).

The given code prepares data by applying Standardization to numerical features and One-Hot Encoding to categorical features.

The goal is to:
1️⃣ Transform numerical features using StandardScaler (scaling data to a mean of 0 and standard deviation of 1).
2️⃣ Convert categorical features using OneHotEncoder (converting text categories into numerical values).
3️⃣ Combine both transformations using ColumnTransformer for a seamless data preprocessing pipeline.

Breaking Down the Code

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

🔹 OneHotEncoder: Converts categorical values into a binary matrix (e.g., "Red", "Blue", "Green" → [1,0,0], [0,1,0], [0,0,1]).
🔹 StandardScaler: Scales numerical features by subtracting the mean and dividing by the standard deviation.
🔹 ColumnTransformer: Applies transformations to different columns simultaneously.

Creating Transformers

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

numeric_transformer = StandardScaler(): Standardizes numerical features.
oh_transformer = OneHotEncoder(): Encodes categorical features.

Applying ColumnTransformer

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features),        
    ]
)

"OneHotEncoder", oh_transformer, cat_features: Applies OneHotEncoder to categorical columns.
"StandardScaler", numeric_transformer, num_features: Applies StandardScaler to numerical columns.
ColumnTransformer applies both transformations to different columns simultaneously.

The preprocessor object is now ready to:
✔ Transform categorical features into numerical vectors
✔ Standardize numerical features for better model performance
✔ Be used in a Machine Learning pipeline for consistent preprocessing

🚀 Final Takeaways

ColumnTransformer makes it easy to handle mixed data types (categorical + numerical).
OneHotEncoder is useful for encoding categorical variables for models.
StandardScaler ensures numerical features are on the same scale.
This approach is scalable & efficient for ML pipelines.

StandardScaler to numerical features and OneHotEncoder to categorical features.

Breaking Down the Code

Creating Transformers

Applying ColumnTransformer

🚀 Final Takeaways

Subscribe to my newsletter

Ashok Vanga

Ashok Vanga