How to Handle Missing Values in Your AI Project (When Less Than 5% of Data is Missing)

Yuvraj SinghYuvraj Singh
8 min read

I hope you are doing great, so today we will discuss an important topic in machine learning handling missing values. After going through this blog you will be completely aware about which technique to use and when to use. So without any further dealy let's get started.

Why even handle missing values ?

Today, we will examine how to deal with less than 5% missing values in some feature in our dataset, but before doing so, it is important to understand why we are even attempting to deal with the missing values. For example, would it be harmful to leave them in place, and if so, what kind of harm might they cause?.

The answer to above questions is that not removing missing values from a dataset can have several consequences, some of which are:

  1. Our dataset's missing values have a substantial impact on the statistical analysis of the data, which can result in wrong conclusions.
  1. Many machine learning algorithms cannot handle the missing data, because of which if we will try to train them on missing data then in return we will get error.
๐Ÿ’ก
Note that there are some machine learning algorithms which are robust to the presence of missing values such as Naive bayes, Tree based algorithms and KNN algorithm.

Techniques to handle missing values

The overall selection of which technique to use to handle missing values totally depends upon the percentage missing values in some feature or dataset. For example: by doing df.isnull().mean()*100 we will get percentage of missing values in all the features and in all the features which will be having 5% or less than 5% missing values in all such features we can either use CCA or Basic imputation techniques ( reccomended ).

๐Ÿ’ก
For all those features which have more than 5% or 10% missing values we should never use the CCA technique instead we must focus only on some advance imputation techniques, which we will discuss in the next upcoming blog.

Technique 1 : Complete case analysis

Complete Case Analysis commonly known as (CCA), is a method which is used to handle the missing values and in this method we simply remove all those rows from our dataset for which any column value is missing. There are additional prerequisites that must be met before using this approach, including the following ๐Ÿ‘‡

  1. The data must be missing at random (MCAR), and if the data is not missing at random in some cases, we won't apply the CCA technique. For example : if part of the initial or last n rows' column values are missing in such scenario we will not use CCA technique. Additionally, we must plot the histogram (density function) to determine whether or not the data in our dataset are missing at random. If the distributions are similar to one other, then the data is MCRA.

  2. We can utilise CCA if there are fewer than 5% of missing values in a column, but we won't use it if there are more than 5% of missing data.

Code implementation

Let's now understand how to deal with missing values by simply removing them from our dataset.

import pandas as pd

# Create a sample dataset with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [6, None, 8, 9, 10],
        'C': [11, 12, 13, None, 15]}
df = pd.DataFrame(data)

# Print the original dataset
print("Original Dataset:")
print(df)

# Remove rows with missing values
df_removed = df.dropna()

# Print the dataset after removing missing values
print("\nDataset after removing missing values:")
print(df_removed)

Drawbacks of CCA

CCA has a few downsides that make it less popular than other approaches. Two of these drawbacks are as follows:

  1. Our dataset might lose a significant amount of data if CCA is used.

  2. There must not be any missing values when we deploy our model to the server because during the training pipeline the way we are dealing with missing values is by removing them but during the prediction stage if some missing values would be there then our model will simply drop that input query point and in response we will get nothing or an error.

Technique 2 : Imputing missing values

As we previously stated, there are two ways to deal with the missing values in our dataset. Now, we'll look at imputation approaches, which simply means that rather than eliminating the missing values from our dataset, we'll fill them with some value.

But before getting too far into the imputation procedures, we must first determine whether the missing data are categorical or numerical, and then we must decide whether to use univariate or multivariate imputation.

Univariate imputationMultivariate imputation
It means for imputing a missing value in some featurew we will be using the values of that feature only.It means for imputing a missing value in some feature we will be utilizing the values of other features
Example of univariate imputation techniques : Mean imputation, mode imputation, median imputation or arbitrary value imputationExample of multivariate imputation techniques : MICE algorithm, KNN imputer
๐Ÿ’ก
In this blog post we will be discussing about the univariate imputation techniques because for 5% or less than 5% missing values univariate techniques works very well and the use of multi-variate technqiues would sort of overkill.

Univariate imputation techniques

Now we will discuss each and every univariate imputation techniques along with python code. Univariate techniques include : Mean, median, mode and arbitrary value imputation. First let's understand what does mean, median and mode actually means in case you don't know about it.

MeanMedianMode
It is one of the central tendency measures that is used to determine the average value of data distribution.One of the central tendency measures is used to determine a number that, divides the distribution of data into two equal portions when ordered in either ascending or descending orderOne of the central tendency measures is used to determine the most frequently occurring value
๐Ÿ’ก
Mean and median values are used to fill in numerical missing values, whereas mode is used to fill in the categorical missing value

Now let's talk about how to put this into practice. When filling in missing numbers with the mean, median, and mode, there are 2 things which we must keep in mind

  1. We must first divide the dataset into train, test and validation set.

  2. We must only use the training data to find the mean, median or mode value.

Code implementation

import pandas as pd
import numpy as np

# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4, 5],
        'B': [6, np.nan, 8, 9, 10],
        'C': ['Dog', 'Cat', 'Dog', np.nan, 'Dog']}
df = pd.DataFrame(data)

# Print the original dataset
print("Original Dataset:")
print(df)

# Mean imputation
df['A'].fillna(df['A'].mean(), inplace=True)

# Median imputation
df['B'].fillna(df['B'].median(), inplace=True)

# Mode imputation
df['C'].fillna(df['C'].mode()[0], inplace=True)

Drawbacks of using mean/median/mode imputation

Mean/median imputation has three main flaws that prevent it from becoming an effective univariate numerical imputation method.

  1. Mean and median imputation can also distort the statistical properties of the data, such as the variance and covariance.

  2. Mean and median imputation can be time-consuming and may not be practical for large datasets with a large number of missing values.

  3. Mean and median imputation may not be appropriate if there will be some outliers, as the mean and median are sensitive to outliers and may not accurately represent the central tendency of the data.

Filling missing values using an arbitrary value

Arbitrary value imputation is a unique technique used to replace missing values with a pre-defined arbitrary value such as -999, -1, or any other value that can be easily identified as imputed. We use this technique when the data is not MCAR in other words we can say that we use this technique when the data is Missing not at random.

๐Ÿ’ก
Aribtrary value means a specific value, that is often outside the range of the existing data.

However the major drawback of this technique is that it can introduce bias in the dataset, as it alters the original distribution and relationships within the data. This bias can affect subsequent analyses and modeling results.

Code implementation

import pandas as pd
import numpy as np

# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4, 5],
        'B': [6, np.nan, 8, 9, 10],
        'C': [11, 12, np.nan, np.nan, 15]}
df = pd.DataFrame(data)

# Perform arbitrary value imputation
arbitrary_value = -999
df_imputed = df.fillna(arbitrary_value)

Which technique to apply finally ?

There is no fixed answer for this question, but I can tell you what I normally do. Basically I simply create 2 copies of the dataframe one for removing the missing values and other for imputing the missing values. After donig so for all the numerical columns I compare the change in distribution and choose that technique using which the distribution is not changing a lot.

One additional thing which we must keep in mind is that, if the missing value are categorical, in such a scenario, the ratio of each category before and after the removal of missing values must be the same.

Short note

I hope you good understanding of what is feature scaling, how to implement the technique using python code and when to use which technique so if you liked this blog or have any suggestion kindly like this blog or leave a comment below it would mean a to me.

๐Ÿ’ก
Also I would love to connect with you, so here is my Twitter and linkedin
0
Subscribe to my newsletter

Read articles from Yuvraj Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yuvraj Singh
Yuvraj Singh

With hands-on experience from my internships at Samsung R&D and Wictronix, where I worked on innovative algorithms and AI solutions, as well as my role as a Microsoft Learn Student Ambassador teaching over 250 students globally, I bring a wealth of practical knowledge to my Hashnode blog. As a three-time award-winning blogger with over 2400 unique readers, my content spans data science, machine learning, and AI, offering detailed tutorials, practical insights, and the latest research. My goal is to share valuable knowledge, drive innovation, and enhance the understanding of complex technical concepts within the data science community.