Exploring the Retail Sales Kaggle Dataset
Hello everyone! π Iβm excited to share insights from my first task in the HNG11 internship Data Analytics track. Letβs review the dataset and see what we can discover!
Introduction
HNG internship is a fast-paced bootcamp for learning digital skills such as Software Development, Data Analytics, Software Testing, DevOps, and Design, to name a few. It also provides an avenue to network, collaborate with other techies, and access exclusive jobs via the HNG premium network. It also offers networking opportunities, collaboration with fellow tech enthusiasts, and access to exclusive job openings through the HNG premium network.
This task involves analyzing a dataset consisting of 2823 rows and 25 columns, containing details about individual orders, customer information, product data, and sales. The goal is to understand the dataset's structure and derive initial insights from a preliminary exploration.
Observations
Diverse Data Types π
- The dataset includes 25 columns: 16 categorical (e.g., order status, product line, country) and 9 numerical (e.g., order quantity, price, year).
Missing Values β οΈ
Several columns, such as additional addresses, state information, postal codes, and territory details, have missing values, which may limit some aspects of the analysis.
# Identify the columns with missing data null_count = sales_df.isnull().sum() # Filter columns with missing values greater than 0 null_count[null_count > 0] # Result ADDRESSLINE2 2521 STATE 1486 POSTALCODE 76 TERRITORY 1074
Varied Order Sizes and Prices πΈ
Order sizes range from 6 to 97 items, with an average of 35 items per order.
Item prices vary from $26.88 to $100, averaging about $83.66.
Sales Figures π
Sales amounts range from $482.13 to $14,082.80, with an average sale of approximately $3,553.89.
Manufacturer's suggested retail prices (MSRP) vary from $33 to $214, averaging around $100.
Product Diversity ποΈ
- The dataset includes multiple product types, providing an opportunity to analyze sales performance across different categories.
Seasonal Trends π
The dataset covers sales throughout the year, with a notable concentration around the middle of the year.
Data mainly pertains to the early 2000s, specifically between 2003 and 2005, with most entries around 2003.
Conclusion
The initial review of the sample sales dataset has highlighted several key areas for further exploration. A detailed analysis should focus on sales performance by product line, periods, and geographical distribution. Addressing missing values and converting data types accurately will be essential for a more precise analysis. Continued investigation will yield deeper insights into the sales data, helping to identify significant trends and patterns.
Potential Areas for Further Analysis
Sales Performance:
- Examine trends in sales figures over various periods (quarterly, monthly, and yearly) to understand sales distribution and trends.
Product Analysis:
- Investigate the relationships between product types, prices, and sales figures.
Customer Segmentation:
- Identify customer segments based on order behavior (quantity, frequency) and demographics (location).
Data Cleaning and Preprocessing:
- Resolve missing values, convert appropriate columns to numerical data types, and properly format the ORDERDATE column for time-series analysis.
Geographical Insights:
- Utilize the customer location data to conduct a geographical analysis of sales performance.
In conclusion, the HNG11 internship provides a great opportunity to develop data analytics skills and gain practical experience. By exploring and analyzing this dataset, we can uncover valuable insights that can drive business decisions and strategies.
Thanks for reading! π
Subscribe to my newsletter
Read articles from Pauline B. directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by