Sales Data Analysis with Pandas: A Complete End-to-End EDA Case Study By SanjayRam

SanjayramSanjayram
3 min read

Project Overview

In this blog, we walk through a real-world, FAANG-level exploratory data analysis (EDA) project using 12 months of e-commerce sales data from 2019. The dataset includes transaction details such as product, price, date, quantity, and location. We'll use the Python Pandas library to perform high-quality data wrangling, analysis, and visualization to uncover actionable business insights.


Tools & Technologies Used

  • Python 3.10+

  • Pandas for data manipulation

  • Matplotlib and Seaborn for visualization

  • Jupyter Notebook for interactive coding


Step 1: Data Loading & Merging

We first read all 12 CSV files using a list comprehension and pd.read_csv(), then merge them using pd.concat().

files = ["January.csv", "February.csv", ..., "December.csv"]
all_data = pd.concat([pd.read_csv(f"data/raw/{file}") for file in files])

Step 2: Data Cleaning

  • Dropped all NaN rows using dropna()

  • Removed header rows mixed in data

  • Converted relevant columns to correct types (int, float, datetime)

all_data = all_data.dropna(how='all')
all_data = all_data[all_data['Order Date'].str[0:2] != 'Or']
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])

Step 3: Feature Engineering

  • Month: Extracted from Order Date

  • Sales: Quantity Ordered * Price Each

  • City: Parsed from Purchase Address

  • Hour: Extracted from datetime

all_data['Month'] = all_data['Order Date'].dt.month
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']

def get_city(address):
    return address.split(',')[1]

def get_state(address):
    return address.split(',')[2].split(' ')[1]

all_data['City'] = all_data['Purchase Address'].apply(lambda x: get_city(x) + ' ' + get_state(x))
all_data['Hour'] = all_data['Order Date'].dt.hour

Step 4: Exploratory Data Analysis

Q1: What was the best month for sales?

results = all_data.groupby('Month').sum()['Sales']
results.plot(kind='bar')

Insight: December had the highest sales revenue, suggesting strong holiday demand.


Q2: What city sold the most product?

results = all_data.groupby('City').sum()['Sales']
results.plot(kind='bar')

Insight: San Francisco topped all cities in sales.


Q3: What time should we display ads?

hours = all_data['Hour'].value_counts().sort_index()
hours.plot(kind='bar')

Insight: Peak hours for purchases were around 11 AM and 7 PM.


Q4: What products are most often sold together?

from itertools import combinations
from collections import Counter

df_dup = all_data[all_data['Order ID'].duplicated(keep=False)]
df_dup['Grouped'] = df_dup.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
df_dup = df_dup[['Order ID', 'Grouped']].drop_duplicates()

count = Counter()
for row in df_dup['Grouped']:
    products = row.split(',')
    count.update(Counter(combinations(products, 2)))

print(count.most_common(10))

Insight: iPhone and Lightning Charging Cable are frequently bought together.


Q5: What product sold the most? Why?

product_group = all_data.groupby('Product').sum()['Quantity Ordered']
prices = all_data.groupby('Product').mean()['Price Each']
product_group.plot(kind='bar')

Insight: USB-C Charging Cable sold the most, likely due to low cost and high utility.


Step 5: Data Visualization

  • Used matplotlib to create bar plots for:

    • Monthly sales

    • Sales by city

    • Purchases by hour

  • Visualized product popularity and combination frequencies


Business Recommendations

  1. Run promotional campaigns in December to maximize sales

  2. Focus advertising in high-performing cities like San Francisco and Los Angeles

  3. Place ads at 11 AM and 7 PM when customers are most likely to buy

  4. Offer bundles for popular product combinations

  5. Promote affordable accessories to drive volume


Project Files

  • Jupyter Notebook: Sales_Analysis.ipynb

  • Cleaned Data CSV: final_all_data.csv


Final Thoughts

This project showcases usage of Pandas, along with business intuition and storytelling. It translates raw transactional data into actionable strategies with clean visuals and code clarity.

Want to build a portfolio project that impresses? Start with real data, ask impactful questions, and tell a compelling story.


About the Author

Sanjay is an aspiring data scientist. Passionate about Python, analytics, and solving real business problems with data.

Connect: [LinkedIn]

Git Repository: [Git]

0
Subscribe to my newsletter

Read articles from Sanjayram directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sanjayram
Sanjayram