The Critical Error I Made When Observing My Data (Don't Make The Same)

Enis JesugbogoEnis Jesugbogo
2 min read

First glance at a dataset is not always what it seems. You'd have to perform basic analysis to properly understand the data before drawing any conclusion.

This was my experience with the Retail Dataset on Kaggle, a task I undertook with the HNG Internship 2024, to help me improve my data interpretation and communication skills.

The dataset displays the retail sales of a business, detailing the order number, date, price of each unit, and customer details.

On first glance at the data, I assumed that the order numbers were unique values and hence there would be no repetition - time to draw charts and get this over with. Or so I thought.

Another column named “order line number” made me think twice.

Upon further examination and proper definition with the help of chatgpt (a constant life saver, by the way), I observed that:

  • The order number represents each order in the dataset, while the order line number represents the position of a specific product in that order number.

Hence when grouped together using the sort and group function in Excel, each order number represents a customer and all the sales he/she made.

For example, order number 10100 contains 4 products. This customer named Young, made a total purchase of $12,133.25.

This realization laid the foundation for other observations.

With the help of a pivot table, I visualized the total sales made by all order numbers in each year.

  • From the chart, the sales in 2004 significantly increased by 14% from its former figure in 2003. However, the value dropped significantly in 2005 by about 45%. On further exploration, I discovered that the data point for 2005 stops in May. This suggests that when the dataset was released, the company was still in May, 2005.

A quick and final observation was that:

  • The dataset contains different types of data. Columns such as sales, price each, were numerical, while columns such as date, names, productive, etc were categorical data.

Conclusion

This was a very basic analysis and more in-depth analysis can be done to understand the products that brought more sales and why.

We could also analyze each customer and how often they bought products, so that they can be compensated and appreciated at the end of each year as loyal customers.

Credit goes to the HNG family for making me do this in the first place. Cheers!

0
Subscribe to my newsletter

Read articles from Enis Jesugbogo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Enis Jesugbogo
Enis Jesugbogo