OLIST Dataset Data Analysis

Introduction

This blog post walks through a fascinating project that analyzes the Brazilian E-commerce Public Dataset by Olist to uncover valuable insights into customer behavior and segment the customer base. The project, found in the "olist-customer-insights" GitHub repository by prtk2403, aims to provide actionable recommendations for business improvement based on data analysis.

Project Link - Link

Project Made by - Pratheek Nistala

Project Goal and Data Overview

The core objective is to analyze the Olist dataset to understand various aspects of customer behavior, including demographics, purchasing patterns, popular products, satisfaction levels, and ultimately, to identify distinct customer segments. These insights are intended to drive improvements in areas like marketing strategies, customer retention, and operational efficiency.

The project uses the publicly available Brazilian E-Commerce Public Dataset by Olist. This is a rich collection of data distributed across nine different CSV files:

olist_orders_dataset.csv: Contains order details like ID, customer ID, status, and timestamps.
olist_order_items_dataset.csv: Links orders to products and sellers, including price and freight value.
olist_customers_dataset.csv: Provides customer information (ID, unique ID, zip code, city, state).
olist_order_payments_dataset.csv: Details payment types, installments, and payment values.
olist_order_reviews_dataset.csv: Includes customer review scores and comments.
olist_products_dataset.csv: Basic product details (ID, category, dimensions).
olist_sellers_dataset.csv: Seller details (ID, zip code, city, state).
olist_geolocation_dataset.csv: Provides latitude and longitude for zip codes.
product_category_name_translation.csv: Translates Portuguese product category names to English.

Technologies Utilized

The project is built using standard data science tools in Python:

Python: The primary programming language.
Pandas: Essential for data manipulation and creating DataFrames.
NumPy: Used for numerical operations.
Matplotlib & Seaborn: Libraries for creating static data visualizations. Seaborn enhances Matplotlib with more appealing statistical graphics.
Plotly (Express & Graph Objects): Used for creating interactive data visualizations. Plotly allows for features like tooltips and zooming. make_subplots is used for figures with multiple plots.
Jupyter Notebook: The interactive environment used for developing and presenting the analysis.

Breaking Down the Analysis Notebook

Let's follow the structure of the main notebook (olist-customer-insights.ipynb):

1. Introduction and Setup (Importing Libraries)

The first step involves importing all necessary libraries:

import numpy as np # For numerical operations
import pandas as pd # For data manipulation and analysis
import matplotlib.pyplot as plt # For static plotting
import seaborn as sns # For static statistical graphics
import plotly.express as px # For interactive plotting
import plotly.graph_objects as go # For interactive plotting
from plotly.subplots import make_subplots # For multiple plots in one figure
import warnings
warnings.filterwarnings('ignore') # Suppresses warning messages

Explanation: This block ensures all the tools needed for data processing, analysis, and visualization are available. Suppressing warnings helps keep the output clean.

2. Data Loading

This section reads the nine CSV files into separate Pandas DataFrames:

customers = pd.read_csv('../input/olist-brazilian-ecommerce/olist_customers_dataset.csv')
# ... (similar pd.read_csv() calls for all other 8 datasets)

Explanation: pd.read_csv() is the function used to load data from CSV files into DataFrames. The file paths (../input/...) suggest the notebook was run in an environment like Kaggle.

3. Data Preprocessing & Cleaning

This is a crucial phase to prepare the raw data for analysis.

a. Checking Data Shapes and Info: The notebook initially checks the dimensions (.shape) and gets a summary of data types and non-null values (.info()) for each DataFrame. This gives a first look at the data size and potential issues.
b. Checking for Missing Values: The .isnull().sum() method is used to count the number of missing values (NaN) in each column across all DataFrames. The analysis observes missing values in:
- Product details (product_category_name, dimensions, weight, photos).
- Order timestamps (order_approved_at, order_delivered_carrier_date, order_delivered_customer_date). These are often expected for pending or cancelled orders.
- Review comments (review_comment_title, review_comment_message). Not all customers leave comments.
c. Handling Missing Values: The source notes that the notebook doesn't show a single, general strategy for all missing values; instead, they seem to be handled on a case-by-case basis or implicitly by Pandas during operations. In a real project, careful consideration based on the feature and analysis type is needed.
d. Data Type Conversion (Dates): Timestamp columns, initially loaded as strings, are converted to datetime objects using pd.to_datetime().
```
  orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])
  # ... and for other timestamp columns
```
Explanation: This conversion is vital for performing time-based calculations and extracting temporal features.

e. Merging DataFrames: This is a critical step where the individual datasets are combined into a single master DataFrame (df).

  df = orders.merge(order_items, on='order_id', how='left') # Merge orders and order_items on order_id
  df = df.merge(payments, on='order_id', how='left') # Merge with payments on order_id
  df = df.merge(reviews, on='order_id', how='left') # Merge with reviews on order_id
  df = df.merge(products, on='product_id', how='left') # Merge with products on product_id
  df = df.merge(customers, on='customer_id', how='left') # Merge with customers on customer_id
  df = df.merge(sellers, on='seller_id', how='left') # Merge with sellers on seller_id
  df = df.merge(prod_cat_translation, on='product_category_name', how='left') # Merge product category names

Explanation: The how='left' argument ensures that all rows from the original (left) DataFrame are kept. If a key (like order_id) doesn't have a match in the right DataFrame, the columns from the right side will have missing values (NaN). This results in a consolidated DataFrame allowing for holistic analysis.

f. Feature Engineering: New, more informative features are created from existing data.
- Time-based features: Extracting components like year, month, day, hour from purchase timestamps.
```
  df['order_purchase_year'] = df['order_purchase_timestamp'].apply(lambda x: x.year)
  df['order_purchase_month'] = df['order_purchase_timestamp'].apply(lambda x: x.month_name())
  # ... and similar
```
  Explanation: Using .apply() with a lambda function or the .dt accessor extracts specific parts of the datetime.
- Delivery Time: Calculating the difference between delivery and purchase dates in days.
```
  df['delivery_time'] = (df['order_delivered_customer_date'] - df['order_purchase_timestamp']).dt.days
```
  Explanation: Subtracting datetime objects yields a Timedelta, and .dt.days extracts the day difference. This is a key metric for satisfaction.
- Estimated vs. Actual Delivery: Calculating the difference between estimated and actual delivery dates.
```
  df['estimated_del_time'] = (df['order_estimated_delivery_date'] - df['order_delivered_customer_date']).dt.days
```
  Explanation: Positive values mean early delivery, negative means late.
- Order Status Simplification: Grouping various order statuses into simpler categories like 'Success' or 'Failure'.
```
  df['order_status_upd'] = np.where(df.order_status.isin(['approved', 'processing', 'shipped', 'delivered', 'invoiced']), 'Success', 'Failure')
```
  Explanation: np.where() is a vectorized conditional operation.
- Review Score to Sentiment: Converting numerical review scores (1-5) into categorical sentiments ('Positive', 'Neutral', 'Negative').
```
  # Define a function
  def new_cols(df):
      if df['review_score'] == 5 or df['review_score'] == 4:
          return 'Positive'
      # ... (elif for Neutral, else for Negative)
  # Apply the function row-wise
  df['review_type'] = df.apply(new_cols, axis=1)
```
  Explanation: .apply(..., axis=1) applies a function to each row.

4. Exploratory Data Analysis (EDA) and Visualizations

This is where insights are extracted using various plots. The source provides helpful "Interview Explanations" for discussing these visualizations.

a. Univariate Analysis (Single Variables): Analyzing the distribution of single features.
- Order Status Distribution: Shows the frequency of each order status. Mostly 'delivered' is expected, but 'canceled' or 'unavailable' rates are important. Visualized with bar or pie charts.
- Payment Type Distribution: Visualizes the popularity of payment methods. 'Credit card' is often dominant. Visualized with bar or pie charts.
- Review Score Distribution: Shows how customer review scores (1-5) are distributed. A high number of 5-star reviews is positive, but 1-star reviews warrant investigation. Visualized with bar charts.
b. Bivariate/Multivariate Analysis (Relationships): Analyzing relationships between multiple variables.
- Monthly Orders: Shows the trend of order volume over months and years. Helps identify seasonal trends (e.g., peak around November). Visualized with line charts.
- Orders by Day of Week / Hour of Day: Identifies which times are most popular for ordering. Useful for staffing and timing promotions. Visualized with bar charts.
- Top Product Categories: Identifies the most popular or highest-revenue categories. Helps with inventory and targeted marketing. Visualized with bar charts.
- Customer Location (State/City): Shows the geographical distribution of customers. Important for logistics and regional marketing. Visualized with bar charts or potentially choropleth maps.
- Delivery Time Analysis: Understands typical delivery times and variability. Compares actual vs. estimated delivery. Visualized with histograms or boxplots, potentially by state. Late deliveries (negative estimated_del_time) indicate areas for improvement.
- Review Score vs. Delivery Time: Explores the correlation between delivery speed and customer satisfaction. Longer delivery times often correlate with lower review scores. Visualized with boxplot or violin plot.
- Payment Installments vs. Payment Value: Examines if higher value orders use more installments. Useful for credit policy and payment options. Visualized with scatter plot or boxplot.

5. RFM Analysis (Recency, Frequency, Monetary)

This is a common and powerful technique for customer segmentation based on purchasing behavior.

Recency: How recently a customer made their last purchase. Lower recency (fewer days since last purchase) is better.
Frequency: How often a customer makes purchases (number of orders). Higher frequency is better.
Monetary Value: How much money a customer spends in total. Higher monetary value is better.

The process involves:

Calculating R, F, and M values for each unique customer by grouping the data by customer ID and applying aggregation functions.

 # Calculate Recency, Frequency, Monetary value for each customer
 snapshot_date = df['order_purchase_timestamp'].max() + pd.Timedelta(days=1)
 rfm_data = df.groupby('customer_unique_id').agg({
     'order_purchase_timestamp': lambda x: (snapshot_date - x.max()).days, # Recency
     'order_id': 'count', # Frequency
     'payment_value': 'sum' # Monetary
 }).rename(columns={'order_purchase_timestamp': 'Recency', 'order_id': 'Frequency', 'payment_value': 'MonetaryValue'})

Explanation: groupby('customer_unique_id') groups all rows for a single customer. .agg() applies calculations to specific columns within each group. x.max() gets the latest purchase date for that customer, and subtracting from a snapshot date gives Recency. 'count' counts orders (Frequency), and 'sum' sums payment values (Monetary).

Creating RFM Scores: Assigning scores (often 1-5 or 1-4 using quantiles) based on the calculated R, F, M values. Recency scores are typically inverse (lower days = higher score), while Frequency and Monetary are direct (higher value = higher score).

 # Example using qcut for quartiles
 r_labels = range(4, 0, -1) # Scores 4 to 1 for Recency (4=most recent)
 f_labels = range(1, 5) # Scores 1 to 4 for Frequency (4=most frequent)
 m_labels = range(1, 5) # Scores 1 to 4 for Monetary (4=highest spend)
 rfm_data['R_score'] = pd.qcut(rfm_data['Recency'], q=4, labels=r_labels)
 rfm_data['F_score'] = pd.qcut(rfm_data['Frequency'].rank(method='first'), q=4, labels=f_labels)
 rfm_data['M_score'] = pd.qcut(rfm_data['MonetaryValue'], q=4, labels=m_labels)

Combining Scores & Defining Segments: Concatenating R, F, M scores to get an RFM segment string (e.g., "444" for the best customers) or summing them for a composite score. Customers are then assigned to named segments like "Champions," "Loyal Customers," "At Risk," "Hibernating," etc., based on these scores.

 rfm_data['RFM_Segment'] = rfm_data['R_score'].astype(str) + rfm_data['F_score'].astype(str) + rfm_data['M_score'].astype(str)
 rfm_data['RFM_Score'] = rfm_data[['R_score', 'F_score', 'M_score']].sum(axis=1)

 # Example segmentation logic based on RFM_Score
 def rfm_level(df):
     if df['RFM_Score'] >= 9: return 'Champions'
     # ... other levels defined by score ranges
 rfm_data['RFM_Level'] = rfm_data.apply(rfm_level, axis=1)

Visualizing RFM Segments: Using bar charts to show segment counts or treemaps to visualize segments by size or monetary value. Treemaps are particularly useful for showing the monetary contribution of each segment. For example, "Champions" and "Loyal Customers" might be smaller in number but contribute a significant portion of revenue.

Purpose of RFM: RFM segmentation allows for targeted marketing campaigns, personalized offers, and tailored customer retention strategies. Different segments require different approaches (e.g., loyalty programs for Champions, re-engagement for At Risk).

6. Conclusion & Recommendations

The notebook typically concludes by summarizing key findings from the analysis. Recommendations are derived from these insights, such as:

Focusing marketing efforts during peak purchasing times.
Improving logistics in regions with long delivery times.
Targeting high-value RFM segments with specific programs.
Investigating reasons for negative reviews, especially those linked to delivery issues or specific products.
Adjusting inventory based on popular categories.

Conclusion

This project demonstrates a thorough approach to understanding customer behavior using the Olist e-commerce dataset. By cleaning and integrating data, engineering relevant features, performing extensive exploratory data analysis, and applying techniques like RFM segmentation, valuable insights are derived. These insights, covering purchase trends, geographical distribution, delivery performance, and customer segments, provide a data-driven foundation for Olist to make informed business decisions and improve the customer experience.

Diving into Olist Customer Insights: A Data-Driven Exploration

Table of contents