Recently, I was helping to improve the accuracey of an ML model in production and faced a difficult challenge.

No matter what robust ML algorithm we selected or how we fine tuned the hyperparameters, the accuracy against real-world data just wouldn't improve.

It felt like we were hitting an invisible wall. Spending time. Wasting computing resources while re-training.

Then, we took a step back to the very beginning: the data collection process. And then, everything clicked into place.

Sound familiar?

Selection bias can be a silent killer. It can distort your data leading to inaccurate models.

Have you ever encountered a similar scenario in your projects? It’s frustrating, isn't it?

In this article, I will delve into selection bias and explore four effective random sampling strategies to mitigate its effect. 👇

Understanding Selection Bias

Selection bias arises when the data collection process systematically favors certain individuals or data points over others, leading to a skewed representation of the population.

This leads to an unrepresentative sample.

In the context of ML, selection bias can skew your models and render them ineffective.

This bias can occur due to various factors, such as sampling methods, data accessibility, or human biases in data selection.

The presence of selection bias can have significant consequences on the performance and generalizability of ML models.

Imagine you are building a model to predict customer churn for a subscription-based service.

If your data collection process only includes customers who have actively reached out to customer support, you may be introducing a selection bias.

These customers are more likely to be dissatisfied or experiencing issues, leading to an overestimation of churn rates and a model that does not accurately represent the entire customer base.

It's crucial to recognize and address this issue early in the data collection phase.

By ensuring that your sample accurately represents the population, you enhance the validity and reliability of your models.

The Importance of Random Sampling Techniques

To mitigate selection bias, it is essential to employ random sampling techniques.

Random sampling ensures that every member of the population has an equal chance of being selected, reducing the likelihood of systematic biases.

By randomly selecting data points, we aim to obtain a representative sample that accurately reflects the characteristics and diversity of the population.

There are several popular random sampling strategies that can be used to collect data for machine learning projects.

Each strategy has its advantages and disadvantages, and the choice depends on the specific requirements and constraints of the project.

Let's explore these strategies in detail.

If you like this article, share it with others ♻️

Would help a lot ❤️

Simple Random Sampling (SRS)

What is Simple Random Sampling?

Simple Random Sampling (SRS) is the most straightforward and basic sampling strategy.

It involves randomly selecting a subset of individuals or data points from the population, giving every member an equal probability of being chosen.

SRS is effective in minimizing selection bias, as it ensures an unbiased representation of the population.

SRS is particularly suitable when the population is homogeneous and there are no specific subgroups of interest.

For example, if you want to randomly select a subset of customers from a customer database to conduct a survey, SRS can be an appropriate choice.

Implementation and Code

Here's an example of how to perform SRS using Python and pandas:

import pandas as pd

# Assuming `df` is your DataFrame
df_random_sample = df.sample(frac=0.1, random_state=42)  # 10% random sample
print(df_random_sample.head())

Advantages of SRS

Easy to implement and understand.
Reduces bias, as every member of the population has an equal chance of selection.

Disadvantages of SRS

May not capture rare subgroups effectively, especially in highly diverse populations.
Can be inefficient if the population is large and difficult to enumerate.

Use Cases

Quality Control: A manufacturing company randomly selects products from a production line to test for quality assurance.
Opinion Polls: Polling organizations randomly select individuals from a voter registry to predict election outcomes.
Medical Research: Selecting a random sample of patients for a clinical trial.

Stratified Sampling

What is Stratified Sampling?

Stratified sampling is a strategy that involves dividing the population into subgroups or strata based on relevant characteristics such as age, gender, or location.

Once the population is stratified, a random sample is selected from each stratum, ensuring that the sample is representative of the population's diversity.

Stratified sampling is particularly useful when there are known subgroups within the population that need to be accurately represented.

By sampling from each stratum proportionally, we can address selection bias and ensure that the sample reflects the population's composition.

For example, if you are conducting a medical research study and want to ensure representation from different age groups and genders, stratified sampling can be employed.

Implementation and Code

Here's an example of how to perform stratified sampling using Python and scikit-learn:

from sklearn.model_selection import train_test_split

# Assuming `df` is your DataFrame and `target` is the stratification column
X = df.drop(columns='target')
y = df['target']

_, X_stratified_sample, _, y_stratified_sample = train_test_split(
    X, y, test_size=0.1, stratify=y, random_state=42
)

stratified_sample = pd.concat(
    [X_stratified_sample, 
    y_stratified_sample], 
    axis=1,
)
print(stratified_sample.head())

Advantages of Stratified Sampling

Ensures representation from all subgroups, making it more representative than simple random sampling.
Useful for highlighting specific subgroups and comparing them.

Disadvantages of Stratified Sampling

Requires detailed knowledge of the population to define strata accurately.
More complex to implement and analyze compared to SRS.

Use Cases

Healthcare Research: Ensuring representation from different age groups, genders, and ethnicities to study the effects of a new drug.
Market Research: Ensuring feedback from various market segments, such as different income levels, regions, and consumer types.
Election Polling: Selecting a representative sample of voters from different demographic groups.

Cluster Sampling

What is Cluster Sampling?

Cluster sampling involves dividing the population into smaller groups or clusters and randomly selecting entire clusters rather than individual members.

This strategy is useful when the population is geographically dispersed or when it's difficult to obtain a complete list of the population.

This approach can be cost-effective and efficient, especially when the clusters are heterogeneous.

For example, if you are conducting a study on school performance, you can randomly select entire schools (clusters) and include all students within the selected schools in the sample.

Implementation and Code

Here's an example of how to perform cluster sampling using Python and NumPy:

# Assuming `df` has a column `cluster` that defines the clusters
clusters = df['cluster'].unique()
selected_clusters = np.random.choice(clusters, size=2, replace=False)  # Selecting 2 clusters

sample_indexes = df['cluster'].isin(selected_clusters)
cluster_sample = df[sample_indexes]
print(cluster_sample.head())

Advantages of Cluster Sampling

Cost-effective and efficient for large, geographically dispersed populations.
Reduces travel and administrative costs associated with data collection.

Disadvantages of Cluster Sampling

Higher risk of bias if the clusters are not representative of the population.
Can introduce intra-cluster correlation, affecting the independence of observations.

Use Cases

Geographical Studies: Conducting a survey on water usage by selecting entire towns or regions as clusters.
Epidemiological Studies: Studying disease prevalence by selecting entire villages or neighborhoods.
School Performance Evaluations: Evaluating teaching methods by selecting entire schools as clusters.

Systematic Sampling

What is Systematic Sampling?

Systematic sampling involves selecting individuals from a population at regular intervals based on a predetermined pattern.

This strategy selects every nth data point from a list, ensuring a consistent and predictable sampling pattern.

Systematic sampling is efficient and easy to implement, making it a popular choice in various domains.

It can provide a representative sample if the population list is randomly ordered.

However, it's important to be cautious of potential periodicity bias if there are underlying patterns in the population list.

For example, if you want to select every 10th customer from a customer list to participate in a study, systematic sampling can be employed.

Implementation and Code

Here's an example of how to perform systematic sampling using Python and NumPy:

import numpy as np

# Assuming `df` is your DataFrame
sample_size = int(len(df) * 0.1)  # 10% sample
interval = len(df) // sample_size
random_start = np.random.randint(0, interval)

systematic_sample = df.iloc[random_start::interval]
print(systematic_sample.head())

Advantages of Systematic Sampling

Simple to implement and understand.
Ensures a spread across the population if the list is random.

Disadvantages of Systematic Sampling

Risk of periodicity bias if there is an underlying pattern in the population list.
Not entirely random, as the selection process follows a fixed interval.

Use Cases

Production Line Sampling: Checking every 10th item on a production line for defects.
Website User Experience: Selecting every 50th user session to analyze website performance and user experience.
Library Book Checks: Assessing the condition of a library collection by selecting every 20th book from the catalog.

Addressing Selection Bias in Practice

When dealing with selection bias in a machine learning project, it's crucial to carefully consider the data collection process and employ appropriate sampling techniques.

Here are some practical steps to address selection bias:

Define the target population: Clearly identify the population of interest and the desired characteristics of the sample.
Choose an appropriate sampling strategy: Based on the project requirements, population characteristics, and available resources, select a suitable sampling strategy (e.g., SRS, stratified sampling, cluster sampling, or systematic sampling).
Ensure representativeness: Strive to obtain a representative sample that accurately reflects the diversity and composition of the population. Consider factors such as demographics, geographic distribution, and relevant subgroups.
Monitor and assess bias: Regularly assess the collected data for potential biases and take corrective actions if necessary. Conduct exploratory data analysis to identify any skewed distributions or underrepresented subgroups.
Validate and iterate: Validate the performance of your machine learning models on independent test sets to assess their generalizability. Iterate and refine the data collection process if needed to address any identified biases.

Conclusion

Dealing with selection bias is crucial in the data collection phase of a machine learning project.

By understanding the concept of selection bias and employing appropriate random sampling strategies, we can mitigate its effects and ensure the integrity and fairness of our models.

Simple Random Sampling, Stratified Sampling, Cluster Sampling, and Systematic Sampling each offer unique advantages and challenges.

Choosing the right strategy depends on your specific needs and the characteristics of your population.

Each strategy has its strengths and limitations, and the choice depends on the specific project requirements and constraints.

By mitigating selection bias, you pave the way for more accurate and reliable machine learning models.

Real World ML - Dealing with Selection Bias

Understanding Selection Bias

The Importance of Random Sampling Techniques

Simple Random Sampling (SRS)

What is Simple Random Sampling?

Implementation and Code

Advantages of SRS

Disadvantages of SRS

Use Cases

Stratified Sampling

What is Stratified Sampling?

Implementation and Code

Advantages of Stratified Sampling

Disadvantages of Stratified Sampling

Use Cases

Cluster Sampling

What is Cluster Sampling?

Implementation and Code

Advantages of Cluster Sampling

Disadvantages of Cluster Sampling

Use Cases

Systematic Sampling

What is Systematic Sampling?

Implementation and Code

Advantages of Systematic Sampling

Disadvantages of Systematic Sampling

Use Cases

Addressing Selection Bias in Practice

Conclusion

Subscribe to my newsletter

Juan Carlos Olamendy

Juan Carlos Olamendy