Day 5: Feature Engineering and Selection


Feature Engineering is the process of transforming raw data into features that improve a model’s performance.
Imagine a scenario; where you are training a model for prediction of house price. The dataset contains “Year Built,” but this doesn’t tell us much. Instead a better feature could be “House Age” (formula for the same: Current Year - Year Built). This small change can improve model accuracy!
Feature Engineering comes into picture when there is a need to optimise our data for better accuracy and results. It has capability to transform an average model into a great one. So buckle up as we learn in depth about it.
Now a question might arise here to all, “Is raw data enough?“. The Answer is definitely no.
Raw data can be very messy, clustered or unstructered. It might get difficult to extract informmation from the data due to its condition. You should never forget that our main goal is to extract meanigful data from the raw data.
The impact of extracting meaningful data in turn produces better predictions; helping models to learn better and better. Poor features produce poor models, so keep this in mind.
Types of Feature Engineering
- Extracting Time-based features
Suppose we have a column named “Date“ which is a raw date (YYYY-MM-DD) format. Now instead of using any raw date, we can extract more meaningful insights for better understanding.
For Example:
import pandas as pd
df = pd.DataFrame({"PurchaseDate": ["2023-01-10", "2023-06-15", "2023-11-23"]}) # (Date Format used: YYYY-MM-DD)
df["PurchaseDate"] = pd.to_datetime(df["PurchaseDate"])
df["DayOfWeek"] = df["PurchaseDate"].dt.day_name() # Day of the week (Mon, Tue, Wed,.etc)
df["Month"] = df["PurchaseDate"].dt.month # Month
df["IsWeekend"] = df["DayOfWeek"].isin(["Saturday", "Sunday"]).astype(int) # Is Weekend? 0 for False, 1 for True
print(df)
This will greatly help in predictions, as we can make more better decisions.
- Encoding Categorical Variables
Lets look at our base problem; the ML models not necessarily understand text. So we encode them into something more understandable.
City_Chicago City_New York City_San Francisco
New York False True False
San Francisco False False True
Chicago True False False
Can be converted into:
Code for the same:
import pandas as pd
df = pd.DataFrame({"City": ["New York", "San Francisco", "Chicago"]})
df = pd.get_dummies(df, columns=["City"], drop_first=True)
print(df)
- Scaling Numerical Features
Imagine you are having data pertaining to a car as follows:
Engine Power (HP): 120, 150, 180, 200 <= || => Price ($): 20,000, 30,000, 50,000, 100,000
Our Problem: The price range is much larger than horsepower, which can mislead models. This is the reason why we should consider scaling.
This example shows the MinMaxScaling, which typically scales the values between 0 and 1.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Creating the dataset
data = {
"Car": ["Toyota", "Honda", "Ford", "BMW"],
"Horsepower": [120, 150, 180, 200],
"Price": [20000, 30000, 50000, 100000]
}
# Convert dictionary to DataFrame
df = pd.DataFrame(data)
# Initialize Min-Max Scaler
scaler = MinMaxScaler()
# Apply Min-Max Scaling to numerical columns
df[["Horsepower_Scaled", "Price_Scaled"]] = scaler.fit_transform(df[["Horsepower", "Price"]])
# Display the DataFrame
print(df)
One more exmaple is of Standard Scaling, where 0 indicates mean and 1 indicates Standard Deviation.
Only make subtlt changes in above code as below:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["Price_Standardized"]] = scaler.fit_transform(df[["Price"]])
Commonly used Method for Feature Engineering
- Co-relation Analysis
The theory speaks that if 2 features are highly co-related to one another, they may not add much information. For Example: “Height (cm)” and “Height (inches)” are redundant and does not help much.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample Data
df = pd.DataFrame({"Height_cm": [170, 160, 180], "Height_inches": [67, 63, 71]})
# Correlation Matrix
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show()
Conclusive Notes:
✅ Feature Engineering improves model accuracy.
✅ Encoding categorical data makes it usable for machine learning.
✅ Scaling ensures numerical features are balanced.
✅ Feature selection removes unnecessary variables.
Its time to take a day off. We will dive deep in next day’s topic. Ciao!
Subscribe to my newsletter
Read articles from Saket Khopkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Saket Khopkar
Saket Khopkar
Developer based in India. Passionate learner and blogger. All blogs are basically Notes of Tech Learning Journey.