A/B Testing and Experimentation Checklist
Embarking on an experimentation journey requires meticulous planning and a well-defined roadmap to ensure the reliability of results. In this guide, we'll explore the critical steps involved in crafting a successful experiment, from formulating a hypothesis to analyzing results. Let's dive into the key considerations and methodologies for creating a robust experimental design.
Pre Experiment
Formulating a Falsifiable Hypothesis
Begin by crafting a hypothesis that is simple, clear, and falsifiable. This sets the foundation for your experiment, enabling you to either prove or disprove the stated hypothesis using data.
Choosing Metrics
Define success metrics that align with your experiment's objectives. Additionally, identify guard rail metrics, data quality metrics, and user engagement metrics for continuous monitoring and quality checks throughout the experiment.
Calculating Metric Sensitivity
Enhance sensitivity by analyzing past experiment data and movement probability. Consider selecting a population subset to increase sensitivity, ensuring your metrics are responsive to changes.
Selecting Robust Metrics
Opt for metrics that capture meaningful signals without being overly sensitive to irrelevant fluctuations. This ensures that the metrics selected are truly reflective of the experiment's impact.
Pre-Experiment Bias Check
Conduct an A/A Test using pre-experiment data to identify and include users with minimal bias, ensuring the success metric is not distorted by external factors.
Unit of Randomization
Carefully choose the unit of randomization (User_id, Cookie_id, Session_id, Device_id, IP Address) based on the experiment's nature and objectives.
Stable Unit Treatment Value Assumption
Verify the assumption of non-interference among users. If violated, consider using Ego Clusters, Switchback design, or Geo Cluster for randomization to preserve the integrity of the experiment.
Variance Mismatch Check
Employ A/A Tests and bootstrapping to detect variance mismatches. Utilize the Delta method for correction if disparities between empirical and analytically derived variances are found.
Variance Reduction Techniques
Optimize the analysis by focusing on only users directly affected by the experiment (Triggered Analysis)
Cap the Metric or use log transformations
Remove Outlier by looking at bumps in p-value distribution in A/A Test.
Practical Significance Level
Determine the effect size or practical significance level that the experiment aims to capture. This guides the experiment's design and analysis, aligning it with real-world impact.
Sample Size Considerations
Use a sample size calculator based on statistical significance level (alpha) and power (1-beta). Run the experiment for at least one week to account for weekday and weekend effects.
Multiple Testing Strategies
Address multiple testing challenges by recalculating sample size, adopting Bonferroni Correction or Tukey's Honestly Significant Difference or by assigning different significance levels based on the strength of belief in the null hypothesis.
For Example: 0.05 for expected impacts, 0.01 for potential impacts, and 0.001 for unlikely impacts.
During Experiment
Implement Real-time Alerts and Auto Shutdown Feature:
Set up real-time alerts and utilize auto-shutdown features to promptly respond to any anomalies during the experiment.
Gradual Ramp-Up Considering Speed, Quality, and Risk:
Gradually increase the experiment's scale while balancing the speed of implementation, maintaining data quality, and mitigating associated risks.
Roll out the experiment internally to team members and beta users to gather qualitative feedback on its performance and usability.
Begin with a 5% user exposure and closely monitor user engagement metrics for any immediate impact.
Exercise caution when increasing user exposure to 50%, especially if there's a potential for a Novelty/Primacy effect. Consider allowing more time at this stage.
Proceed to a full-scale rollout to 100% of users once confident in the experiment's stability and performance.
Establish a Long-term Holdout Group:
- Maintain a long-term holdout group for future comparisons, rerunning the experiment at a later date to gain more confidence and assess the impact on key metrics like retention.
Ongoing Sample Ratio Mismatch Checks:
- Continuously monitor for sample ratio mismatches within different user segments to ensure a representative and unbiased sample.
Consideration for P-value Peeking:
- If p-value peeking is allowed, ensure the presence of a statistical boundary to maintain the integrity of the experiment and prevent unwarranted influence on decision-making.
After Experiment
Sample Ratio Mismatch - Chi Squared Test
Ramp up bug: Be vigilant for any issues during the ramp-up phase that may skew user assignment or introduce biases.
Parallel Experiments: Be cautious about running parallel experiments as they can disrupt user assignment sometimes
Dynamic Segmentation Rule: When employing dynamic user fields for segmentation, ensure the rule's stability to maintain consistency in user grouping.
Bot Classification for Variants: Investigate if one of the variants is erroneously classified as a bot, potentially affecting the experiment outcomes.
Device or Browser Issues: Address any issues related to specific devices or browsers that may impact user experience and metrics.
Guardrail/Invariant Metrics Check:
Examine if there are changes in guardrail or invariant metrics, as these should remain stable to ensure the validity of the experiment.
Statistical Tests Based on Chosen Metrics:
Depending on the chosen metric, conduct appropriate statistical tests such as t-test or 2 proportions z-test.
Confidence Interval and P-value Calculation:
If p-value is less than the significance level and the confidence interval doesn't contain zero, conclude statistical significance.
If the lower end of the confidence interval is greater than the expected effect size, conclude practical significance.
Overlapping confidence intervals are acceptable up to 29%.
CUPED Adjustment:
Consider using Controlled Experiment Using Pre-Experiment Data (CUPED) to reduce the size of confidence intervals, particularly useful with smaller sample sizes and when higher statistical power is required.
Optional Sign Test:
Use the Sign Test to assess the direction of metric changes.
Simpson Paradox Check:
If subset segments show different directions in metrics compared to the overall results, investigate for potential Simpson Paradox effects.
Launch/No Launch Decision:
When deciding whether to launch or not, consider assigning weights to metrics based on their relevance to the current business direction and the product's maturity. Make decisions aligned with the overall business strategy.
Subscribe to my newsletter
Read articles from Arvindh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Arvindh
Arvindh
I am Sr Data Scientist at Target, working at the intersection of optimization and machine learning. I got my Masters in Business Analytics from University of Texas at Dallas. 💻 Currently in pursuit to learn more about Deep Learning and Generative AI You can reach me at : https://www.linkedin.com/in/arvindh-arul/