Day 5: Feature Engineering and Selection

Saket KhopkarSaket Khopkar
4 min read

Feature Engineering is the process of transforming raw data into features that improve a model’s performance.

Imagine a scenario; where you are training a model for prediction of house price. The dataset contains “Year Built,” but this doesn’t tell us much. Instead a better feature could be “House Age” (formula for the same: Current Year - Year Built). This small change can improve model accuracy!

Feature Engineering comes into picture when there is a need to optimise our data for better accuracy and results. It has capability to transform an average model into a great one. So buckle up as we learn in depth about it.


Now a question might arise here to all, “Is raw data enough?“. The Answer is definitely no.

Raw data can be very messy, clustered or unstructered. It might get difficult to extract informmation from the data due to its condition. You should never forget that our main goal is to extract meanigful data from the raw data.

The impact of extracting meaningful data in turn produces better predictions; helping models to learn better and better. Poor features produce poor models, so keep this in mind.


Types of Feature Engineering

  • Extracting Time-based features

Suppose we have a column named “Date“ which is a raw date (YYYY-MM-DD) format. Now instead of using any raw date, we can extract more meaningful insights for better understanding.

For Example:

import pandas as pd  

df = pd.DataFrame({"PurchaseDate": ["2023-01-10", "2023-06-15", "2023-11-23"]})  # (Date Format used: YYYY-MM-DD)
df["PurchaseDate"] = pd.to_datetime(df["PurchaseDate"])  
df["DayOfWeek"] = df["PurchaseDate"].dt.day_name()    # Day of the week (Mon, Tue, Wed,.etc)
df["Month"] = df["PurchaseDate"].dt.month     # Month
df["IsWeekend"] = df["DayOfWeek"].isin(["Saturday", "Sunday"]).astype(int)  # Is Weekend? 0 for False, 1 for True

print(df)

This will greatly help in predictions, as we can make more better decisions.

  • Encoding Categorical Variables

Lets look at our base problem; the ML models not necessarily understand text. So we encode them into something more understandable.

                City_Chicago  City_New York  City_San Francisco
New York            False           True               False
San Francisco       False           False              True
Chicago             True            False              False

Can be converted into:

Code for the same:

import pandas as pd

df = pd.DataFrame({"City": ["New York", "San Francisco", "Chicago"]})  
df = pd.get_dummies(df, columns=["City"], drop_first=True)  
print(df)
  • Scaling Numerical Features

Imagine you are having data pertaining to a car as follows:
Engine Power (HP): 120, 150, 180, 200 <= || => Price ($): 20,000, 30,000, 50,000, 100,000

Our Problem: The price range is much larger than horsepower, which can mislead models. This is the reason why we should consider scaling.

This example shows the MinMaxScaling, which typically scales the values between 0 and 1.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Creating the dataset
data = {
    "Car": ["Toyota", "Honda", "Ford", "BMW"],
    "Horsepower": [120, 150, 180, 200],
    "Price": [20000, 30000, 50000, 100000]
}

# Convert dictionary to DataFrame
df = pd.DataFrame(data)

# Initialize Min-Max Scaler
scaler = MinMaxScaler()

# Apply Min-Max Scaling to numerical columns
df[["Horsepower_Scaled", "Price_Scaled"]] = scaler.fit_transform(df[["Horsepower", "Price"]])

# Display the DataFrame
print(df)

One more exmaple is of Standard Scaling, where 0 indicates mean and 1 indicates Standard Deviation.

Only make subtlt changes in above code as below:

from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
df[["Price_Standardized"]] = scaler.fit_transform(df[["Price"]])


Commonly used Method for Feature Engineering

  • Co-relation Analysis

The theory speaks that if 2 features are highly co-related to one another, they may not add much information. For Example: “Height (cm)” and “Height (inches)” are redundant and does not help much.

import seaborn as sns  
import matplotlib.pyplot as plt  
import pandas as pd

# Sample Data  
df = pd.DataFrame({"Height_cm": [170, 160, 180], "Height_inches": [67, 63, 71]})  

# Correlation Matrix  
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")  
plt.show()


Conclusive Notes:

Feature Engineering improves model accuracy.
Encoding categorical data makes it usable for machine learning.
Scaling ensures numerical features are balanced.
Feature selection removes unnecessary variables.

Its time to take a day off. We will dive deep in next day’s topic. Ciao!

0
Subscribe to my newsletter

Read articles from Saket Khopkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Saket Khopkar
Saket Khopkar

Developer based in India. Passionate learner and blogger. All blogs are basically Notes of Tech Learning Journey.