Sql Data Cleaning Project

Introduction

Data cleansing is the process of correcting inaccurate, missing, duplicate, or otherwise erroneous data in a dataset. It is also known as data cleaning or data scrubbing. This process involves identifying and fixing data mistakes by adding, deleting, or altering them. It is a key step in the preparation of data since it maintains the reliability and accuracy of the data used for analysis or machine learning algorithms. In this project, the objective is to clean and prepare raw data for analysis.

Data cleaning is crucial in ensuring that data is reliable as it gets increasingly integrated into business operations and decision-making. Apart from organizing raw data in a structured format, data cleaning has many benefits, such as: Makes Data Analytics-Ready, Enhances Operational Efficiency, Promotes Better Business Decision-Making, and effects Cost-Effective Solution.

Data Cleaning Procedure

Firstly, the dataset was imported into MYSQL, and before any transformation is applied to the dataset, the raw dataset is copied to a staging table where the transformation is applied to ensure the raw data set is still intact in case of errors. The dataset is a list of companies from around the world, the industries they belong to, and the layoffs they had carried out between a three year period, 2020 to 2023. Cleaning this dataset will enable us to get the right insights into which industries and companies where most affected by layoffs. Also, we will get insights into the period in years when the most layoffs were carried out.

The dataset consists of companies, the industries they belong to, location, countries, total laid off, layoff date, company funding stage, and funds raised in millions. The data cleaning objective of this dataset is to;

  1. Remove duplicate entries

2. Standardise the data.

3. Deal with Null or absent data values.

4. Remove any columns or rows of unnecessary data or blank data.

Creating the database, Importing the Dataset and Copying the Dataset to a Staging Table.

Firstly the database to store the data table for import is created.

CREATE DATABASE World_Layoffs;

Then the raw data set is imported into the World_layoffs database using the SQL data import wizard.

Then the staging table is created, and data is copied from the dataset to the staging table for transformation.

CREATE TABLE Layoffs_Staging

LIKE Layoffs;

# Creates a table like the layoffs table with the same number of columns and same names.

# Next we insert all the values from the layoffs table into the staging table.

INSERT layoffs_staging

SELECT *

FROM layoffs;

# We confirm that the data was copied successfully.

SELECT *

FROM layoffs_staging;

Removing Duplicates

To identify the duplicate entries, a row number over and partitioned by all the columns in the dataset is assigned to each row in the dataset. That way, any duplicate entry gets a row number greater than 1.

SELECT *, ROW_NUMBER () OVER(partition by company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) AS Row_Num

FROM layoffs_staging

;

# The output of the above query is set as a CTE so as to filter off it where Row_Num is greater than 1.

With Duplicates AS (

SELECT *, ROW_NUMBER () OVER(partition by company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) AS Row_Num

FROM layoffs_staging)

SElECT *

FROM Duplicates

WHERE Row_Num > 1

;

# The duplicates entries have successfully been identified, indicated by a Row_number greater than 1.

Since CTEs are not real tables and can’t be Updated in MYSQL a new table with identical columns is created to store the dataset with the new column Row_Num that identifies the duplicates.

CREATE TABLE Layoffs_Staging2

LIKE Layoffs_Staging

;

ALTER TABLE Layoffs_Staging2

ADD COLUMN Row_Num INT

;

INSERT INTO layoffs_staging2

SELECT *, ROW_NUMBER () OVER(partition by company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions) AS Row_Num

FROM layoffs_staging

;

DELETE FROM layoffs_staging2

WHERE Row_Num > 1

;

Standardising the Dataset

UPDATE layoffs_staging2

SET company = TRIM(company)

;

# Removing white spaces form the Company column

SELECT DISTINCT (Industry)

FROM layoffs_staging2

ORDER BY 1

;

#We the check the industry column and we see that there are three distinct variations of the crypto industry. This needs to be fixed for proper result during analysis. Same is done for other columns.

UPDATE layoffs_staging2

SET Industry = 'Crypto Currency'

WHERE industry Like 'CRYPTO%'

;

SELECT DISTINCT country

FROM layoffs_staging2

ORDER BY 1;

UPDATE layoffs_staging2

SET country = 'United States'

WHERE country Like 'United Stat%'

;

SELECT DISTINCT location

FROM layoffs_staging2

ORDER BY 1

;

UPDATE layoffs_staging2

SET location = 'Malmo'

WHERE location Like 'Malm%'

;

UPDATE layoffs_staging2

SET location = 'Düsseldorf'

WHERE location LIKE '%sseldorf'

;

# The date column is in the text format in the American dating style so it needs to be changed to the MYSQL standard date format.

UPDATE layoffs_staging2

SET date = STR_TO_DATE ( date, '%m/%d/%Y' )

;

ALTER TABLE layoffs_staging2

MODIFY COLUMN date date

;

Dealing with Null and Absent Values

We have some null and empty values in industry, which we can populate based on same companies with industry populated.

SELECT *

FROM layoffs_staging2

WHERE industry IS NULL

OR industry = ''

;

# We set all the blank values to null to simplify our SQL script.

UPDATE layoffs_staging2

SET industry = NULL

WHERE industry = ''

;

# Now lets identify the industries with null values which have same companies with industries populated

SELECT t1.company, t1.industry, t1.location, t2.location, t2.company, t2.industry

FROM layoffs_staging2 t1

JOIN layoffs_staging2 t2

ON t1.company = t2.company

WHERE t2.industry IS NULL

AND t1.industry IS NOT NULL;

# Now we update the industry in the affected rows

UPDATE layoffs_staging2 t2

JOIN layoffs_staging2 t1

ON t1.company = t2.company

SET t2.industry = t1.industry

WHERE t2.industry IS NULL

AND t1.industry IS NOT NULL;

# And we have populated all industry fields with values from the industry field of the same companies

Removing Unnecessary Data

Our key metric for this dataset is total laid off. Since there is no data on total company size there is no way to know the total laid off from the percent laid off when the total laid off data is missing. So entries without total laid off are not useful to us for this dataset.

SELECT *

FROM layoffs_staging2

WHERE total_laid_off IS NULL

;

DELETE

FROM layoffs_staging2

WHERE total_laid_off IS NULL

;

# Row_Num has served its purpose in identifying duplicates; it is not needed anymore in this dataset.

ALTER TABLE layoffs_staging2

DROP COLUMN Row_Num;

# The final dataset.

SELECT *

FROM layoffs_staging2

ORDER BY total_laid_off DESC

;

Outcome

The outcome is a clean dataset ready for analysis. The SQL data cleaning project resulted in a thoroughly cleaned and structured dataset that is now analysis-ready. Duplicate entries were successfully identified and removed using advanced SQL window functions. Data standardisation ensured consistent naming conventions for company names, industries, countries, and locations, thereby eliminating analytical discrepancies. Text-based date formats were converted to standard SQL date formats, and missing values were intelligently populated using relational logic across matching records. Unnecessary and unusable rows—especially those lacking key metrics like total layoffs—were also removed to preserve the integrity of future analyses. This clean dataset provides a solid foundation for generating accurate insights into industry-specific layoffs, time-based patterns, and geographical impacts between 2020 and 2023, ultimately making it highly useful for data analysis, business intelligence reporting, and policy-making.

Conclusions

The problem addressed in this project was the presence of dirty, inconsistent, and incomplete data within a dataset of global company layoffs. Such data, if left uncleaned, would hinder meaningful analysis and decision-making. The benefits of solving this problem are far-reaching: clean data leads to better analytics, clearer insights, more efficient operations, and improved business decisions.

To solve the problem, the dataset was imported into MySQL, duplicated into a staging table to preserve the original data, and cleaned using a series of SQL queries. These queries handled duplication removal using row numbering, standardised categorical data, transformed date formats, replaced or removed null values, and filtered out irrelevant records.

Throughout the project, key knowledge areas were applied: data cleaning principles, SQL scripting, relational database theory, and best practices in data integrity. Skills demonstrated include writing efficient SQL queries, using Common Table Expressions (CTEs), applying window functions, performing data type transformations, and ensuring referential consistency. This reflects a solid understanding of both SQL and professional data cleaning methodologies.

The final outcome is a high-quality, analysis-ready dataset, fit for generating insights into global employment trends. For businesses with similar problems, my recommendation is to always begin by preserving the raw dataset, use staging tables to safely test changes, and apply step-by-step cleaning logic. It's also vital to understand the business context of the data, as this guides what to retain or discard. Investing time in thorough data cleaning saves effort during analysis and enhances the accuracy of all data-driven decisions.

0
Subscribe to my newsletter

Read articles from Anthony Oghenejabor directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anthony Oghenejabor
Anthony Oghenejabor