Day 1: Understanding data preprocessing, handling missing data

Hello again! I’m following Avik’s Day 1 template for my 50 Days of ML Challenge. This module covers the basics of data importing, preprocessing, and feature scaling. For the dataset used in my code, I reused a dataset that I created for one of my courses. The data has been downloaded from US Census Bureau. Link to dataset: https://github.com/mumtazf/50_days_of_ml/blob/main/worcester.csv. Let’s dive right in!

Day 1.jpg

Step 1: Importing packages

import numpy as np import pandas as pd

Step 2: Importing data

worcester_data = pd.read_csv("worcester.csv")

Step 3: Handling Missing Data

missing = worcester_data.isnull().sum()

We see that median income household and count_person is missing. These variables stand for median income and population of a county respectively

Imputing (filling in) the missing values

from sklearn.impute import SimpleImputer, IterativeImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

worcester_data[['Value:Mean_Income_Household']] = imputer.fit_transform(worcester_data[['Value:Mean_Income_Household']])

worcester_data[['Value:Count_Person']] = imputer.fit_transform(worcester_data[['Value:Count_Person']]) worcester_data from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer

imputer2 = IterativeImputer(max_iter=10, random_state=0)

worcester_data[['Value:Mean_Income_Household']] = imputer2.fit_transform(worcester_data[['Value:Mean_Income_Household']])

worcester_data[['Value:Count_Person']] = imputer2.fit_transform(worcester_data[['Value:Count_Person']]) worcester_data

Step 4: Encoding Categorical Data

Doing this for illustration only. It's not practical or useful in this scenario due to the nature of the data and the feature used for the example.

from sklearn.preprocessing import LabelEncoder labelenc = LabelEncoder()

worcester_data['military'] = labelenc.fit_transform(worcester_data['military']) worcester_data

Step 5: Split the data into test and train

from sklearn.model_selection import train_test_split

y = worcester_data['Value:Mean_Income_Household'] X = worcester_data.drop(columns=['Value:Mean_Income_Household'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, train_size = 0.8, random_state = 10)

Step 6: Feature Scaling

Z-score normalization is done to prevent model errors due to the scale of input data. Many ML models are sensitive to the scale of data. For example, if a county has 100 people and another has 240,000 then the range of the data affects model accuracy.

z = (data_point - mean_of_column) / standard_deviation_of_column

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() X_train['Value:Count_Person'] = scaler.fit_transform(X_train[['Value:Count_Person']]) X_test['Value:Count_Person'] = scaler.fit_transform(X_test[['Value:Count_Person']]) X_test

Find the code here along with the dataset: https://github.com/mumtazf/50_days_of_ml/blob/main/day1.ipynb

Day 1 of 50: Data Preprocessing