Encode with Precision, Achieve Machine Learning Vision
Akshay Kumar
3 min read
Encoders are used to convert categorical data into a format that can be understood by machine learning algorithms. The choice of encoder depends on the nature of the categorical data and the specific requirements of the machine learning model. Here are some common types of encoders and their use cases:
- One-Hot Encoding: This method is used when the categorical variable is nominal (i.e., the categories do not have a natural order). It creates a binary column for each category. It's useful for nominal variables but can lead to a high-dimensional dataset if there are many unique categories.
**Example**: Encoding a "color" feature with categories "red", "blue", and "green" into three binary columns.
- Label Encoding: This method assigns a unique integer to each category. It's suitable for ordinal variables (where the categories have a natural order). However, it can introduce an arbitrary ordering of the categories, which might not be meaningful*."This encoder works the same as ordinal encoding and should only be used when the target variable (Y) is categorical."*.
**Example**: Encoding a "size" feature with categories "small", "medium", and "large" into the integers 0, 1, and 2.
- Ordinal Encoding: Similar to label encoding but specifically designed for nominal variables. It assigns integers to categories based on their order, which can be useful for nominal variables that have a natural ordering.
**Example**: Encoding a "priority" feature with categories "low", "medium", and "high" into the integers 0, 1, and 2.
- Binary Encoding: This method first converts the category into numerical labels using label encoding, then converts these integers into binary code. It's useful for nominal variables with many categories, as it reduces the dimensionality of the data.
**Example**: Encoding a "category" feature with 100 unique categories into binary code, reducing the dimensionality from 100 to 7 (since 2^7 = 128, which is more than enough to represent 100 categories).
- Feature Hashing: This method applies a hash function to the features and uses the hash values as indices directly. It's handy for categorical features with lots of unique values. However, it can lead to collisions where different categories map to the same hash value, potentially losing information.
**Example**: Encoding a "user_id" feature with millions of unique IDs into a fixed-size vector using a hash function.
- Target Encoding: This method involves replacing a categorical value with the mean of the target variable. It's helpful for categorical features with many different values, but it can cause overfitting if not used carefully.
**Example**: Encoding a "region" feature in a sales dataset with many unique regions by replacing each region with the average sales value for that region.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from category_encoders import BinaryEncoder, OneHotEncoder, TargetEncoder
from category_encoders import HashingEncoder
# Sample data this is not accurate data the main thing to focus
# on what kind of data categories have be used
data = pd.DataFrame({ 'user_id': ['user123', 'user456', 'user789', 'user001'],
'gender': ['M', 'F', 'M', 'F'],
'region': ['North', 'South', 'East', 'West'],
'age': [25, 30, 35, 40],
'color': ['red', 'blue', 'green', 'red', 'blue', 'green'],
'shape': ['circle', 'square', 'triangle', 'circle', 'square', 'triangle'] })
#Define the column transformer
preprocessor = ColumnTransformer(
transformers=[ ('binary', BinaryEncoder(), ['gender']),
('onehot', OneHotEncoder(sparse=False,drop='first'), ['region']),
('label', LabelEncoder(), ['color']),
('ordinal', OrdinalEncoder(categories='auto'), ['shape']),# we have to manuall define categories in a list
# ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cofee']),
# by specifying the we are telinng to give bigger no. to stong and smaller to mild
('target', TargetEncoder(), ['age']) ]),
('hashing', HashingEncoder(cols=['user_id'], n_components=8), ['user_id'])
#this line will convert the data into data frame
sklearn.set_config(transform_output="pandas")
# split data
# X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size=0.2)
#Define the pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression()) ])
#Fit the pipeline to the data
# X_train and y_train should be your training data and labels
# pipeline.fit_transform(X_train, y_train)
# Apply the preprocessing
data_encoded = preprocessor.fit_transform(data)
# Convert the output to a DataFrame for easier viewing
data_encoded_df = pd.DataFrame(data_encoded, columns=preprocessor.get_feature_names_out())
print(data_encoded_df)
0
Subscribe to my newsletter
Read articles from Akshay Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Akshay Kumar
Akshay Kumar
Make data science simple