Sales Data Analysis with Pandas: A Complete End-to-End EDA Case Study By SanjayRam


Project Overview
In this blog, we walk through a real-world, FAANG-level exploratory data analysis (EDA) project using 12 months of e-commerce sales data from 2019. The dataset includes transaction details such as product, price, date, quantity, and location. We'll use the Python Pandas library to perform high-quality data wrangling, analysis, and visualization to uncover actionable business insights.
Tools & Technologies Used
Python 3.10+
Pandas for data manipulation
Matplotlib and Seaborn for visualization
Jupyter Notebook for interactive coding
Step 1: Data Loading & Merging
We first read all 12 CSV files using a list comprehension and pd.read_csv()
, then merge them using pd.concat()
.
files = ["January.csv", "February.csv", ..., "December.csv"]
all_data = pd.concat([pd.read_csv(f"data/raw/{file}") for file in files])
Step 2: Data Cleaning
Dropped all NaN rows using
dropna()
Removed header rows mixed in data
Converted relevant columns to correct types (
int
,float
,datetime
)
all_data = all_data.dropna(how='all')
all_data = all_data[all_data['Order Date'].str[0:2] != 'Or']
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
Step 3: Feature Engineering
Month: Extracted from
Order Date
Sales:
Quantity Ordered * Price Each
City: Parsed from
Purchase Address
Hour: Extracted from datetime
all_data['Month'] = all_data['Order Date'].dt.month
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']
def get_city(address):
return address.split(',')[1]
def get_state(address):
return address.split(',')[2].split(' ')[1]
all_data['City'] = all_data['Purchase Address'].apply(lambda x: get_city(x) + ' ' + get_state(x))
all_data['Hour'] = all_data['Order Date'].dt.hour
Step 4: Exploratory Data Analysis
Q1: What was the best month for sales?
results = all_data.groupby('Month').sum()['Sales']
results.plot(kind='bar')
Insight: December had the highest sales revenue, suggesting strong holiday demand.
Q2: What city sold the most product?
results = all_data.groupby('City').sum()['Sales']
results.plot(kind='bar')
Insight: San Francisco topped all cities in sales.
Q3: What time should we display ads?
hours = all_data['Hour'].value_counts().sort_index()
hours.plot(kind='bar')
Insight: Peak hours for purchases were around 11 AM and 7 PM.
Q4: What products are most often sold together?
from itertools import combinations
from collections import Counter
df_dup = all_data[all_data['Order ID'].duplicated(keep=False)]
df_dup['Grouped'] = df_dup.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
df_dup = df_dup[['Order ID', 'Grouped']].drop_duplicates()
count = Counter()
for row in df_dup['Grouped']:
products = row.split(',')
count.update(Counter(combinations(products, 2)))
print(count.most_common(10))
Insight: iPhone and Lightning Charging Cable are frequently bought together.
Q5: What product sold the most? Why?
product_group = all_data.groupby('Product').sum()['Quantity Ordered']
prices = all_data.groupby('Product').mean()['Price Each']
product_group.plot(kind='bar')
Insight: USB-C Charging Cable sold the most, likely due to low cost and high utility.
Step 5: Data Visualization
Used
matplotlib
to create bar plots for:Monthly sales
Sales by city
Purchases by hour
Visualized product popularity and combination frequencies
Business Recommendations
Run promotional campaigns in December to maximize sales
Focus advertising in high-performing cities like San Francisco and Los Angeles
Place ads at 11 AM and 7 PM when customers are most likely to buy
Offer bundles for popular product combinations
Promote affordable accessories to drive volume
Project Files
Jupyter Notebook:
Sales_Analysis.ipynb
Cleaned Data CSV:
final_all_data.csv
Final Thoughts
This project showcases usage of Pandas, along with business intuition and storytelling. It translates raw transactional data into actionable strategies with clean visuals and code clarity.
Want to build a portfolio project that impresses? Start with real data, ask impactful questions, and tell a compelling story.
About the Author
Sanjay is an aspiring data scientist. Passionate about Python, analytics, and solving real business problems with data.
Connect: [LinkedIn]
Git Repository: [Git]
Subscribe to my newsletter
Read articles from Sanjayram directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
