Art of Feature Engineering

Ayush SinghAyush Singh
5 min read

Feature engineering is the backbone of machine learning. Transforming raw data into a format that models can understand is crucial for building efficient and robust systems. In this blog, we’ll explore key aspects of feature engineering, including handling missing values, addressing imbalanced datasets, and applying encoding techniques to transform data.

Missing Values

Missing values occur when certain information is not stored in the dataset. Handling them effectively is vital to ensure the model’s performance is not compromised. Missing data mechanisms can be categorized into three types:

1. Missing Completely at Random (MCAR)

In MCAR, there is no systematic reason for why data is missing. The missing values are randomly distributed across the dataset.

Example: Data missed out by users while filling a survey form. They might unknowingly or knowingly skip certain fields that are important for model variables.

2. Missing at Random (MAR)

In MAR, the probability of a value being missing depends on the observed data but not on the missing data itself. In other words, missing data is systematically related to the observed data.

Example: In an income survey, some participants might feel uncomfortable sharing their income information. If the survey focuses on identifying average income by age, the missing data for income is systematically related to age.

3. Missing Not at Random (MNAR)

In MNAR, the probability of a value being missing depends on unobserved data. The missing data is not random and is associated with unobserved factors.

Example: While collecting data on job satisfaction and income, employees who are less satisfied might intentionally avoid reporting their income. The missing data depends on their satisfaction levels, which are not directly measured.

Handling Imbalanced Datasets

Imbalanced datasets are common in classification problems where one class significantly outnumbers another. For example, consider a dataset with 1,000 data points:

  • 900 instances are “yes”

  • 100 instances are “no”

Training a model on such data would likely result in a biased model that predicts the majority class. To address this, two approaches are commonly used:

1. Upsampling

Increase the number of instances in the minority class by duplicating samples or generating synthetic ones (e.g., using SMOTE).

2. Downsampling

Reduce the number of instances in the majority class by randomly removing samples to achieve a balanced distribution.

## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df=pd.concat([class_0,class_1]).reset_index(drop=True)

## upsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True,
         n_samples=len(df_majority),
         random_state=42
        )

df_upsampled=pd.concat([df_majority,df_minority_upsampled])

""" FOR CASE OF DOWN SAMPLING use df_majority with replace=False and n_samples of df_minority length """

SMOTE (Synthetic Minority Oversampling Technique)

SMOTE is a powerful upsampling technique. Instead of duplicating existing data, SMOTE generates synthetic instances of the minority class by interpolating between existing points. This approach ensures that the model learns a balanced representation of both classes.

# DATA PREP
from sklearn.datasets import make_classification
x,y = make_classification(n_samples=1000,n_redundant=0,n_features=2,n_clusters_per_class=1, weights=[0.90],random_state=12)

df_x = pd.DataFrame(x,columns=['f1','f2'])
df_y = pd.DataFrame(y,columns=['target'])
final_df = pd.concat([df_x,df_y],axis=1)
final_df.head()

From this:

from imblearn.over_sampling import SMOTE

oversample = SMOTE()
x,y = oversample.fit_resample(final_df[['f1','f2']],final_df['target'])

df_x = pd.DataFrame(x,columns=['f1','f2'])
df_y = pd.DataFrame(y,columns=['target'])
oversample_df = pd.concat([df_x,df_y],axis=1)

To this:

Data Encoding

Machine learning models primarily work with numerical data, so categorical data must be converted into numerical format. Data encoding methods include:

1. Nominal/One-Hot Encoding

One-hot encoding converts categorical data into binary vectors. It is suitable for nominal data where categories have no inherent order or ranking.

Example: Red: [1, 0, 0], Green: [0, 1, 0], Blue: [0, 0, 1]

from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'cars':['verna','city','baleno','swift','ciaz','kushaq']})
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[['cars']]).toarray()
df_encoded = pd.DataFrame(encoded,encoder.get_feature_names_out())

Pros:

  • Keeps the data unbiased as all categories are equally represented.

Cons:

  • It can lead to overfitting.

  • Results in high dimensionality for datasets with many unique categories.

2. Label and Ordinal Encoding

Label Encoding

Assign a unique integer to each category. Best for nominal data where there is no inherent order.

Example: Red: 0, Green: 1, Blue: 2

from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()
lbl_encoder.fit_transform(df[['cars']])

lbl_encoder.transform([['ciaz']])
lbl_encoder.transform([['verna']])

Pros:

  • Simple to implement and memory-efficient.

  • Does not increase dimensionality.

Cons:

  • Imposes an artificial order that can mislead models like Linear Regression.

Ordinal Encoding

Assigns integer values based on the natural order of categories. Best for ordinal data.

Example: Small: 0, Medium: 1, Large: 2, XL: 3

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

encoder=OrdinalEncoder(categories=[['small','medium','large']])
encoder.fit_transform(df[['size']])

Pros:

  • Preserves order.

  • Memory efficient and works well for models requiring ordinal relationships.

Cons:

  • Assumes a linear relationship between categories, which may not always be true.

3. Target Guided Ordinal Encoding

This technique assigns ordinal values to categories based on their relationship with the target variable. Categories are ranked according to how they influence or correlate with the target.

Example Process:

  1. Calculate the mean target value for each category.

  2. Rank categories based on the mean.

  3. Map categories to ordinal values based on their rank.

df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

# GET THE MEAN and RANK CATEGORY BASED
mean_price=df.groupby('city')['price'].mean().to_dict()

# MAP IT
df['city_encoded']=df['city'].map(mean_price)

Pros:

  • Captures the relationship between features and the target variable.

Cons:

  • Prone to overfitting if the same data is used for encoding and training.

Conclusion

Feature engineering is an art that bridges raw data and machine learning models. From handling missing values to balancing datasets and encoding categorical variables, the quality of your feature engineering directly impacts model performance. By mastering these techniques, you can unlock the full potential of your data and build robust, efficient machine-learning systems.

Resources and Connect with Me

If you’d like to dive deeper into the code examples mentioned in this blog, you can check out the full code on my GitHub repository: CODE

I’d love to connect with you and hear your thoughts! Feel free to follow me on hashnode for more data engineering and machine learning blogs.

Socials: LinkedIn | Twitter | GitHub | Hashnode

9
Subscribe to my newsletter

Read articles from Ayush Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ayush Singh
Ayush Singh

I am a Data Engineer working with HP Inc. Loves to learn and explore new tech stacks to challenge my knowledge and share.