Sampling on Datasets
Once you’ve created a dataset, you want to explore the values inside.
Exploring very large datasets can be difficult, as even simple operations can be expensive, both in terms of computational resources and time. The same sampling principle applies to visualization (Charts), data preparation, and statistical analyses (Statistics). The main purpose of sampling is to provide immediate visual feedback while exploring and preparing the dataset, no matter how large it may be.
Although taking the first 10,000 rows is the fastest sampling method, the sample may be biased depending on the composition of the dataset.
Depending on your needs, there are a number of different sampling methods available, such as random, stratified, or class rebalancing, to name a few.
Some common sampling methods include:
Random Sampling: Selecting data points randomly from the dataset, giving each data point an equal chance of being chosen.
Stratified Sampling: Dividing the dataset into different strata or groups based on certain characteristics and then randomly sampling from each stratum. This ensures representation from each subgroup.
Systematic Sampling: Selecting every
n
th item from the dataset after an initial random start. This method is useful when there is an inherent order or structure in the data.Cluster Sampling: Dividing the dataset into clusters, randomly selecting some clusters, and then sampling all data points within the selected clusters.
Example: Imagine you want to estimate the number of cars in a large car park that spans 50 acres. Instead of counting all the cars, you can count the number of cars in 1 acre and multiply by 50 to estimate the total. Alternatively, you could count the cars in half an acre and multiply by 100 for the same purpose.
Subscribe to my newsletter
Read articles from Dharshini Sankar Raj directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Dharshini Sankar Raj
Dharshini Sankar Raj
Driven by an intense desire to understand data and fueled by the opportunities presented during the COVID-19 pandemic, I enthusiastically ventured into the vast world of Python, Machine Learning, and Deep Learning. Through online courses and extensive self-learning, I immersed myself in these areas. This led me to pursue a Master's degree in Data Science. To enhance my skills, I actively engaged in data annotation while working at Biz-Tech Analytics during my college years. This experience deepened my understanding and solidified my commitment to this field.