Day 3 of 50 Days of ML: Data Cleaning Overview


Often, when we train ML models on data, the dataset we use is messy, filled with NULLs, or has an absurd range of values in a particular column. The dataset needs to be preprocessed and cleaned to prevent noise from affecting the accuracy of the model that we want to train.
There are various ways to approach cleaning and preprocessing the dataset. This list is by no means exhaustive, but it tries to cover some of the most frequently used concepts in feature engineering.
Before I dive in, here’s a quick note on the vocabulary used in this article for the sake of clarity. I refer to datasets mostly in a tabular form (think of a SQL table) for this article. Columns refer to each feature within the dataset, and rows are the individual datapoints.
Checking for missing values within columns
It’s important to identify potential issues with missing data. For scenarios like a column that is mostly “opt-in”, say a non-required question on a form, you should think through what the lack of this data means.
Doing a null value count within each column and identifying which columns have the highest number of nulls can allow you to have a clearer understanding of what features can potentially undermine the model's performance.
Removing duplicates and understanding outliers
Duplicates can often lead to inaccuracies since the model might be attuned to thinking that more of the same values mean that this value is important. Outliers can skew the data distribution and can have unintended effects.
Remember that not all outliers are noise. Sometimes outliers tell you important details about the data you’re looking at. Try to understand outliers by setting a specific threshold to classify a data point as an outlier and then check if these points demonstrate a pattern.
When it comes to duplicates, it is conventional practice to remove them. However, if you want to be certain about their impact, try training a model with duplicates and one without duplicates. Compare the two to understand the impact of duplicates on your specific ML model.
Imputing Missing Data
Missing data impacts model performance and there are many workarounds handling missing values. The most popular one is imputing the values based on the data in the same column.
Replacing the missing value with mean, median, or mode values of the column.
K-nearest neighbors (KNN) imputation helps you find the k most similar rows to the row you’re trying to fill in the misssing vaue. Then it imputes the value based on those similar rows.
Multiple Imputation by Chained Equations (MICE) utilizes a Bayesian approach that creates multiple versions of the dataset with different imputed values and then combines all these versions of the dataset. This explanation by Isabella Ghement is a great resource to learn more about MICE: https://stats.stackexchange.com/a/421586
Regression models can be used to predict missing values based on other features.
Categorical variables can be imputed using techniques like mode imputation or random forest imputation
Data Distribution
The distribution of data and the range of values within columns also affects model performance. For example, consider a row where income = $1,00,000 and height = 170cm. While these two columns are different, the numerical values and the variance of data affects how the model learns.
To “treat” such an effect, data scientists often use methods like standardization or log scaling.
Standardization (z-score normalization): Standardization is the process of changing the data to have a mean of 0 and a standard deviation of 1.
$$z = \frac{x- mean}{std deviation}$$
Log scaling is done to reduce the skewness of the data.
$$z = log(x)$$
I found these techniques helpful and super fun to learn. Hope you find them useful and are able to apply them to your next ML project! :)
Subscribe to my newsletter
Read articles from Mumtaz Fatima directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
