A Supply Chain Analytics Journey: Data Ingestion and Preparation with Python

https://github.com/ahamadc/Supply-Chain-Optimisation
This project is an end-to-end solution that leverages MySQL, Python and Power BI to tackle a real-world supply chain problem. Using a dataset from Kaggle that includes various tables with historical orders. I aim to assign optimal routes for orders while minimising freight and warehouse costs under various constraints. The project will showcase an end-to-end pipeline.
The ingestion and preparation phase involves cleaning, transforming and validating the raw data using Python before it can be fed into MySQL for further analysis. Like dicing an onion before using it in a recipe.
Project Workflow
Data exploration and cleaning using Python (Pandas)
Data upload and schema creation in MySQL
Querying and joining tables using SQL
Cost and route optimization logic (Python or SQL)
Final visualizations and insights using Power BI
Understanding the Dataset
Before diving into cleaning the data, it’s important to understand the raw data. I explored each table by loading it into Pandas to look at the structure, key columns and any initial data quality issues.
The tables were loaded as dataframes, assigned to variables and put into a list.
tables = [order_list, freight_rates, whcosts, whcapacities, products_per_plant, vmi_customers, plant_ports]
for i in tables:
# Check the first few rows
print(i.head())
# Get a concise summary of the DataFrame
print(i.info())
# Get basic statistics
print(i.describe())
The output showed me that all the tables were loaded as dataframes and contain the correct columns. Data types are generally appropriate for the context but there needs to be some changes. The number of rows per table varies from tens to thousands.
Order Date column in Order List table is not set to date
The Order Date column in the Order List table needs to be changed to a datatime data type.
order_list = tables[0]
#change data type of Order Date column to date
order_list['Order Date'] = pd.to_datetime(order_list['Order Date'], format='%d/%m/%Y')
#check the data types of columns to see if change was made
print(order_list.info())
Some numeric columns in Freight Rates table contain non-numeric characters
First I iterated through each character in each item in those specific columns to identify which non-numeric characters are in present and need to be handled.
unique_non_numeric = set()
for item in freight_rates['rate']:
s = str(item)
for char in s:
if not char.isdigit():
unique_non_numeric.add(char)
print("Unique non-numeric characters:", unique_non_numeric)
The result showed that there were $ signs, commas and spaces in the values that prevented them from being recognised as float values. This was then fixed:
freight_rates['max_wgh_qty'] = freight_rates['max_wgh_qty'].astype(str).replace({',': '', ' ': ''}, regex = True).astype(float)
freight_rates['minimum cost'] = freight_rates['minimum cost'].astype(str).replace({',': '', ' ': '', r'\$': ''}, regex = True).astype(float)
freight_rates['rate'] = freight_rates['rate'].astype(str).replace({',': '', ' ': '', r'\$': ''}, regex = True).astype(float)
Check null values in WhCosts table
whcosts = tables[2]
print(whcosts[whcosts['WH'].isnull()])
The row with a null value in the WH column has null values throughout the row so that row can be removed.
whcosts = whcosts.dropna(subset=['WH'])
Plant Ports table has unnamed columns that aren’t meant to be there
First I need to check what columns are in the table
print(plant_ports.columns)
The columns we need are Plant Code and Port.
plant_ports = plant_ports[['Plant Code', 'Port']]
Confirming naming standards with Plant Code
print(order_list['Plant Code'].unique())
print(products_per_plant['Plant Code'].unique())
The final step is to save the cleaned dataframes to new CSV files.
Next Steps:
Load cleaned data into MySQL
Build SQL queries for joining and calculating optimal assignments
Visualise insights in Power BI
Subscribe to my newsletter
Read articles from Ahamad Tawsif Chowdhury directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
