Project Overview

In this blog, we walk through a real-world, FAANG-level exploratory data analysis (EDA) project using 12 months of e-commerce sales data from 2019. The dataset includes transaction details such as product, price, date, quantity, and location. We'll use the Python Pandas library to perform high-quality data wrangling, analysis, and visualization to uncover actionable business insights.

Tools & Technologies Used

Python 3.10+
Pandas for data manipulation
Matplotlib and Seaborn for visualization
Jupyter Notebook for interactive coding

Step 1: Data Loading & Merging

We first read all 12 CSV files using a list comprehension and pd.read_csv(), then merge them using pd.concat().

files = ["January.csv", "February.csv", ..., "December.csv"]
all_data = pd.concat([pd.read_csv(f"data/raw/{file}") for file in files])

Step 2: Data Cleaning

Dropped all NaN rows using dropna()
Removed header rows mixed in data
Converted relevant columns to correct types (int, float, datetime)

all_data = all_data.dropna(how='all')
all_data = all_data[all_data['Order Date'].str[0:2] != 'Or']
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])

Step 3: Feature Engineering

Month: Extracted from Order Date
Sales: Quantity Ordered * Price Each
City: Parsed from Purchase Address
Hour: Extracted from datetime

all_data['Month'] = all_data['Order Date'].dt.month
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']

def get_city(address):
    return address.split(',')[1]

def get_state(address):
    return address.split(',')[2].split(' ')[1]

all_data['City'] = all_data['Purchase Address'].apply(lambda x: get_city(x) + ' ' + get_state(x))
all_data['Hour'] = all_data['Order Date'].dt.hour

Step 4: Exploratory Data Analysis

Q1: What was the best month for sales?

results = all_data.groupby('Month').sum()['Sales']
results.plot(kind='bar')

Insight: December had the highest sales revenue, suggesting strong holiday demand.

Q2: What city sold the most product?

results = all_data.groupby('City').sum()['Sales']
results.plot(kind='bar')

Insight: San Francisco topped all cities in sales.

Q3: What time should we display ads?

hours = all_data['Hour'].value_counts().sort_index()
hours.plot(kind='bar')

Insight: Peak hours for purchases were around 11 AM and 7 PM.

Q4: What products are most often sold together?

from itertools import combinations
from collections import Counter

df_dup = all_data[all_data['Order ID'].duplicated(keep=False)]
df_dup['Grouped'] = df_dup.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
df_dup = df_dup[['Order ID', 'Grouped']].drop_duplicates()

count = Counter()
for row in df_dup['Grouped']:
    products = row.split(',')
    count.update(Counter(combinations(products, 2)))

print(count.most_common(10))

Insight: iPhone and Lightning Charging Cable are frequently bought together.

Q5: What product sold the most? Why?

product_group = all_data.groupby('Product').sum()['Quantity Ordered']
prices = all_data.groupby('Product').mean()['Price Each']
product_group.plot(kind='bar')

Insight: USB-C Charging Cable sold the most, likely due to low cost and high utility.

Step 5: Data Visualization

Used matplotlib to create bar plots for:
- Monthly sales
- Sales by city
- Purchases by hour
Visualized product popularity and combination frequencies

Business Recommendations

Run promotional campaigns in December to maximize sales
Focus advertising in high-performing cities like San Francisco and Los Angeles
Place ads at 11 AM and 7 PM when customers are most likely to buy
Offer bundles for popular product combinations
Promote affordable accessories to drive volume

Project Files

Jupyter Notebook: Sales_Analysis.ipynb
Cleaned Data CSV: final_all_data.csv

Final Thoughts

This project showcases usage of Pandas, along with business intuition and storytelling. It translates raw transactional data into actionable strategies with clean visuals and code clarity.

Want to build a portfolio project that impresses? Start with real data, ask impactful questions, and tell a compelling story.

About the Author

Sanjay is an aspiring data scientist. Passionate about Python, analytics, and solving real business problems with data.

Connect: [LinkedIn]

Git Repository: [Git]

Sales Data Analysis with Pandas: A Complete End-to-End EDA Case Study By SanjayRam