A/B Testing and Experimentation Checklist

ArvindhArvindh
5 min read

Embarking on an experimentation journey requires meticulous planning and a well-defined roadmap to ensure the reliability of results. In this guide, we'll explore the critical steps involved in crafting a successful experiment, from formulating a hypothesis to analyzing results. Let's dive into the key considerations and methodologies for creating a robust experimental design.

Pre Experiment

  • Formulating a Falsifiable Hypothesis

    Begin by crafting a hypothesis that is simple, clear, and falsifiable. This sets the foundation for your experiment, enabling you to either prove or disprove the stated hypothesis using data.

  • Choosing Metrics

    Define success metrics that align with your experiment's objectives. Additionally, identify guard rail metrics, data quality metrics, and user engagement metrics for continuous monitoring and quality checks throughout the experiment.

    • Calculating Metric Sensitivity

      Enhance sensitivity by analyzing past experiment data and movement probability. Consider selecting a population subset to increase sensitivity, ensuring your metrics are responsive to changes.

    • Selecting Robust Metrics

      Opt for metrics that capture meaningful signals without being overly sensitive to irrelevant fluctuations. This ensures that the metrics selected are truly reflective of the experiment's impact.

  • Pre-Experiment Bias Check

    Conduct an A/A Test using pre-experiment data to identify and include users with minimal bias, ensuring the success metric is not distorted by external factors.

  • Unit of Randomization

    Carefully choose the unit of randomization (User_id, Cookie_id, Session_id, Device_id, IP Address) based on the experiment's nature and objectives.

  • Stable Unit Treatment Value Assumption

    Verify the assumption of non-interference among users. If violated, consider using Ego Clusters, Switchback design, or Geo Cluster for randomization to preserve the integrity of the experiment.

  • Variance Mismatch Check

    Employ A/A Tests and bootstrapping to detect variance mismatches. Utilize the Delta method for correction if disparities between empirical and analytically derived variances are found.

  • Variance Reduction Techniques

    • Optimize the analysis by focusing on only users directly affected by the experiment (Triggered Analysis)

    • Cap the Metric or use log transformations

    • Remove Outlier by looking at bumps in p-value distribution in A/A Test.

  • Practical Significance Level

    Determine the effect size or practical significance level that the experiment aims to capture. This guides the experiment's design and analysis, aligning it with real-world impact.

  • Sample Size Considerations

    Use a sample size calculator based on statistical significance level (alpha) and power (1-beta). Run the experiment for at least one week to account for weekday and weekend effects.

  • Multiple Testing Strategies

    Address multiple testing challenges by recalculating sample size, adopting Bonferroni Correction or Tukey's Honestly Significant Difference or by assigning different significance levels based on the strength of belief in the null hypothesis.
    For Example: 0.05 for expected impacts, 0.01 for potential impacts, and 0.001 for unlikely impacts.

During Experiment

  • Implement Real-time Alerts and Auto Shutdown Feature:

    Set up real-time alerts and utilize auto-shutdown features to promptly respond to any anomalies during the experiment.

  • Gradual Ramp-Up Considering Speed, Quality, and Risk:

    Gradually increase the experiment's scale while balancing the speed of implementation, maintaining data quality, and mitigating associated risks.

    • Roll out the experiment internally to team members and beta users to gather qualitative feedback on its performance and usability.

    • Begin with a 5% user exposure and closely monitor user engagement metrics for any immediate impact.

    • Exercise caution when increasing user exposure to 50%, especially if there's a potential for a Novelty/Primacy effect. Consider allowing more time at this stage.

    • Proceed to a full-scale rollout to 100% of users once confident in the experiment's stability and performance.

  • Establish a Long-term Holdout Group:

    • Maintain a long-term holdout group for future comparisons, rerunning the experiment at a later date to gain more confidence and assess the impact on key metrics like retention.
  • Ongoing Sample Ratio Mismatch Checks:

    • Continuously monitor for sample ratio mismatches within different user segments to ensure a representative and unbiased sample.
  • Consideration for P-value Peeking:

    • If p-value peeking is allowed, ensure the presence of a statistical boundary to maintain the integrity of the experiment and prevent unwarranted influence on decision-making.

After Experiment

  • Sample Ratio Mismatch - Chi Squared Test

    • Ramp up bug: Be vigilant for any issues during the ramp-up phase that may skew user assignment or introduce biases.

    • Parallel Experiments: Be cautious about running parallel experiments as they can disrupt user assignment sometimes

    • Dynamic Segmentation Rule: When employing dynamic user fields for segmentation, ensure the rule's stability to maintain consistency in user grouping.

    • Bot Classification for Variants: Investigate if one of the variants is erroneously classified as a bot, potentially affecting the experiment outcomes.

    • Device or Browser Issues: Address any issues related to specific devices or browsers that may impact user experience and metrics.

  • Guardrail/Invariant Metrics Check:

    Examine if there are changes in guardrail or invariant metrics, as these should remain stable to ensure the validity of the experiment.

  • Statistical Tests Based on Chosen Metrics:

    Depending on the chosen metric, conduct appropriate statistical tests such as t-test or 2 proportions z-test.

  • Confidence Interval and P-value Calculation:

    • If p-value is less than the significance level and the confidence interval doesn't contain zero, conclude statistical significance.

    • If the lower end of the confidence interval is greater than the expected effect size, conclude practical significance.

    • Overlapping confidence intervals are acceptable up to 29%.

  • CUPED Adjustment:

    Consider using Controlled Experiment Using Pre-Experiment Data (CUPED) to reduce the size of confidence intervals, particularly useful with smaller sample sizes and when higher statistical power is required.

  • Optional Sign Test:

    Use the Sign Test to assess the direction of metric changes.

  • Simpson Paradox Check:

    If subset segments show different directions in metrics compared to the overall results, investigate for potential Simpson Paradox effects.

  • Launch/No Launch Decision:

    When deciding whether to launch or not, consider assigning weights to metrics based on their relevance to the current business direction and the product's maturity. Make decisions aligned with the overall business strategy.

0
Subscribe to my newsletter

Read articles from Arvindh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Arvindh
Arvindh

I am Sr Data Scientist at Target, working at the intersection of optimization and machine learning. I got my Masters in Business Analytics from University of Texas at Dallas. 💻 Currently in pursuit to learn more about Deep Learning and Generative AI You can reach me at : https://www.linkedin.com/in/arvindh-arul/