Day 1 of 50: Data Preprocessing


Hello again! I’m following Avik’s Day 1 template for my 50 Days of ML Challenge. This module covers the basics of data importing, preprocessing, and feature scaling. For the dataset used in my code, I reused a dataset that I created for one of my courses. The data has been downloaded from US Census Bureau. Link to dataset: https://github.com/mumtazf/50_days_of_ml/blob/main/worcester.csv. Let’s dive right in!
Step 1: Importing packages
import numpy as np import pandas as pd
Step 2: Importing data
worcester_data = pd.read_csv("worcester.csv")
Step 3: Handling Missing Data
missing = worcester_data.isnull().sum()
We see that median income household and count_person is missing. These variables stand for median income and population of a county respectively
Imputing (filling in) the missing values
from sklearn.impute import SimpleImputer, IterativeImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
worcester_data[['Value:Mean_Income_Household']] = imputer.fit_transform(worcester_data[['Value:Mean_Income_Household']])
worcester_data[['Value:Count_Person']] = imputer.fit_transform(worcester_data[['Value:Count_Person']]) worcester_data from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer
imputer2 = IterativeImputer(max_iter=10, random_state=0)
worcester_data[['Value:Mean_Income_Household']] = imputer2.fit_transform(worcester_data[['Value:Mean_Income_Household']])
worcester_data[['Value:Count_Person']] = imputer2.fit_transform(worcester_data[['Value:Count_Person']]) worcester_data
Step 4: Encoding Categorical Data
Doing this for illustration only. It's not practical or useful in this scenario due to the nature of the data and the feature used for the example.
from sklearn.preprocessing import LabelEncoder labelenc = LabelEncoder()
worcester_data['military'] = labelenc.fit_transform(worcester_data['military']) worcester_data
Step 5: Split the data into test and train
from sklearn.model_selection import train_test_split
y = worcester_data['Value:Mean_Income_Household'] X = worcester_data.drop(columns=['Value:Mean_Income_Household'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, train_size = 0.8, random_state = 10)
Step 6: Feature Scaling
Z-score normalization is done to prevent model errors due to the scale of input data. Many ML models are sensitive to the scale of data. For example, if a county has 100 people and another has 240,000 then the range of the data affects model accuracy.
z = (data_point - mean_of_column) / standard_deviation_of_column
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() X_train['Value:Count_Person'] = scaler.fit_transform(X_train[['Value:Count_Person']]) X_test['Value:Count_Person'] = scaler.fit_transform(X_test[['Value:Count_Person']]) X_test
Find the code here along with the dataset: https://github.com/mumtazf/50_days_of_ml/blob/main/day1.ipynb
Subscribe to my newsletter
Read articles from Mumtaz Fatima directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
