Sql Data Cleaning Project

Introduction
Data cleansing is the process of correcting inaccurate, missing, duplicate, or otherwise erroneous data in a dataset. It is also known as data cleaning or data scrubbing. This process involves identifying and fixing data mistakes by adding, deleting, or altering them. It is a key step in the preparation of data since it maintains the reliability and accuracy of the data used for analysis or machine learning algorithms. In this project, the objective is to clean and prepare raw data for analysis.
Data cleaning is crucial in ensuring that data is reliable as it gets increasingly integrated into business operations and decision-making. Apart from organizing raw data in a structured format, data cleaning has many benefits, such as: Makes Data Analytics-Ready, Enhances Operational Efficiency, Promotes Better Business Decision-Making, and effects Cost-Effective Solution.
Data Cleaning Procedure
Firstly, the dataset was imported into MYSQL, and before any transformation is applied to the dataset, the raw dataset is copied to a staging table where the transformation is applied to ensure the raw data set is still intact in case of errors. The dataset is a list of companies from around the world, the industries they belong to, and the layoffs they had carried out between a three year period, 2020 to 2023. Cleaning this dataset will enable us to get the right insights into which industries and companies where most affected by layoffs. Also, we will get insights into the period in years when the most layoffs were carried out.
The dataset consists of companies, the industries they belong to, location, countries, total laid off, layoff date, company funding stage, and funds raised in millions. The data cleaning objective of this dataset is to;
- Remove duplicate entries
2. Standardise the data.
3. Deal with Null or absent data values.
4. Remove any columns or rows of unnecessary data or blank data.
Creating the database, Importing the Dataset and Copying the Dataset to a Staging Table.
Firstly the database to store the data table for import is created.
CREATE DATABASE World_Layoffs;
Then the raw data set is imported into the World_layoffs database using the SQL data import wizard.
Then the staging table is created, and data is copied from the dataset to the staging table for transformation.
CREATE TABLE Layoffs_Staging
LIKE Layoffs;
# Creates a table like the layoffs table with the same number of columns and same names.
# Next we insert all the values from the layoffs table into the staging table.
INSERT layoffs_staging
SELECT *
FROM layoffs;
# We confirm that the data was copied successfully.
SELECT *
FROM layoffs_staging;
Removing Duplicates
To identify the duplicate entries, a row number over and partitioned by all the columns in the dataset is assigned to each row in the dataset. That way, any duplicate entry gets a row number greater than 1.
SELECT *, ROW_NUMBER () OVER(partition by company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) AS Row_Num
FROM layoffs_staging
;
# The output of the above query is set as a CTE so as to filter off it where Row_Num is greater than 1.
With Duplicates AS (
SELECT *, ROW_NUMBER () OVER(partition by company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) AS Row_Num
FROM layoffs_staging)
SElECT *
FROM Duplicates
WHERE Row_Num > 1
;
# The duplicates entries have successfully been identified, indicated by a Row_number greater than 1.
Since CTEs are not real tables and can’t be Updated in MYSQL a new table with identical columns is created to store the dataset with the new column Row_Num that identifies the duplicates.
CREATE TABLE Layoffs_Staging2
LIKE Layoffs_Staging
;
ALTER TABLE Layoffs_Staging2
ADD COLUMN Row_Num INT
;
INSERT INTO layoffs_staging2
SELECT *, ROW_NUMBER () OVER(partition by company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) AS Row_Num
FROM layoffs_staging
;
DELETE FROM layoffs_staging2
WHERE Row_Num > 1
;
Standardising the Dataset
UPDATE layoffs_staging2
SET company = TRIM(company)
;
# Removing white spaces form the Company column
SELECT DISTINCT (Industry)
FROM layoffs_staging2
ORDER BY 1
;
#We the check the industry column and we see that there are three distinct variations of the crypto industry. This needs to be fixed for proper result during analysis. Same is done for other columns.
UPDATE layoffs_staging2
SET Industry = 'Crypto Currency'
WHERE industry Like 'CRYPTO%'
;
SELECT DISTINCT country
FROM layoffs_staging2
ORDER BY 1;
UPDATE layoffs_staging2
SET country = 'United States'
WHERE country Like 'United Stat%'
;
SELECT DISTINCT location
FROM layoffs_staging2
ORDER BY 1
;
UPDATE layoffs_staging2
SET location = 'Malmo'
WHERE location Like 'Malm%'
;
UPDATE layoffs_staging2
SET location = 'Düsseldorf'
WHERE location LIKE '%sseldorf'
;
# The date column is in the text format in the American dating style so it needs to be changed to the MYSQL standard date format.
UPDATE layoffs_staging2
SET date = STR_TO_DATE ( date, '%m/%d/%Y' )
;
ALTER TABLE layoffs_staging2
MODIFY COLUMN date date
;
Dealing with Null and Absent Values
We have some null and empty values in industry, which we can populate based on same companies with industry populated.
SELECT *
FROM layoffs_staging2
WHERE industry IS NULL
OR industry = ''
;
# We set all the blank values to null to simplify our SQL script.
UPDATE layoffs_staging2
SET industry = NULL
WHERE industry = ''
;
# Now lets identify the industries with null values which have same companies with industries populated
SELECT
t1.company
, t1.industry, t1.location, t2.location,
t2.company
, t2.industry
FROM layoffs_staging2 t1
JOIN layoffs_staging2 t2
ON
t1.company
= t2.company
WHERE t2.industry IS NULL
AND t1.industry IS NOT NULL;
# Now we update the industry in the affected rows
UPDATE layoffs_staging2 t2
JOIN layoffs_staging2 t1
ON
t1.company
=
t2.company
SET t2.industry = t1.industry
WHERE t2.industry IS NULL
AND t1.industry IS NOT NULL;
# And we have populated all industry fields with values from the industry field of the same companies
Removing Unnecessary Data
Our key metric for this dataset is total laid off. Since there is no data on total company size there is no way to know the total laid off from the percent laid off when the total laid off data is missing. So entries without total laid off are not useful to us for this dataset.
SELECT *
FROM layoffs_staging2
WHERE total_laid_off IS NULL
;
DELETE
FROM layoffs_staging2
WHERE total_laid_off IS NULL
;
# Row_Num has served its purpose in identifying duplicates; it is not needed anymore in this dataset.
ALTER TABLE layoffs_staging2
DROP COLUMN Row_Num;
# The final dataset.
SELECT *
FROM layoffs_staging2
ORDER BY total_laid_off DESC
;
Outcome
The outcome is a clean dataset ready for analysis. The SQL data cleaning project resulted in a thoroughly cleaned and structured dataset that is now analysis-ready. Duplicate entries were successfully identified and removed using advanced SQL window functions. Data standardisation ensured consistent naming conventions for company names, industries, countries, and locations, thereby eliminating analytical discrepancies. Text-based date formats were converted to standard SQL date formats, and missing values were intelligently populated using relational logic across matching records. Unnecessary and unusable rows—especially those lacking key metrics like total layoffs—were also removed to preserve the integrity of future analyses. This clean dataset provides a solid foundation for generating accurate insights into industry-specific layoffs, time-based patterns, and geographical impacts between 2020 and 2023, ultimately making it highly useful for data analysis, business intelligence reporting, and policy-making.
Conclusions
The problem addressed in this project was the presence of dirty, inconsistent, and incomplete data within a dataset of global company layoffs. Such data, if left uncleaned, would hinder meaningful analysis and decision-making. The benefits of solving this problem are far-reaching: clean data leads to better analytics, clearer insights, more efficient operations, and improved business decisions.
To solve the problem, the dataset was imported into MySQL, duplicated into a staging table to preserve the original data, and cleaned using a series of SQL queries. These queries handled duplication removal using row numbering, standardised categorical data, transformed date formats, replaced or removed null values, and filtered out irrelevant records.
Throughout the project, key knowledge areas were applied: data cleaning principles, SQL scripting, relational database theory, and best practices in data integrity. Skills demonstrated include writing efficient SQL queries, using Common Table Expressions (CTEs), applying window functions, performing data type transformations, and ensuring referential consistency. This reflects a solid understanding of both SQL and professional data cleaning methodologies.
The final outcome is a high-quality, analysis-ready dataset, fit for generating insights into global employment trends. For businesses with similar problems, my recommendation is to always begin by preserving the raw dataset, use staging tables to safely test changes, and apply step-by-step cleaning logic. It's also vital to understand the business context of the data, as this guides what to retain or discard. Investing time in thorough data cleaning saves effort during analysis and enhances the accuracy of all data-driven decisions.
Subscribe to my newsletter
Read articles from Anthony Oghenejabor directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
