Pawbability and Sniffsnifficance Jeopardy Template

The Bread & Butter

POWER Rangers

The Danger Zone

Testing, Testing

Not the Chubby Love Angel

200

This metric tells you the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

What is a p-value?

200

This concept is the probability that your test will correctly detect a true effect when there actually is one.

What is statistical power?

200

If you allocate traffic 50/50 but end up with 45,000 users in Control and 55,000 in Treatment, your test is likely suffering from this data-invalidating phenomenon.

What is Sample Ratio Mismatch?

200

This statistical test is used to determine if there is a statistically significant difference somewhere among the means of 3 or more groups.

What is ANOVA (Analysis of Variance)?

200

This is the statistical strategy of using historical data to shrink the noise in your current experiment.

What is Controlled Experiments using Pre-Experiment Data (CUPED)?

400

This fundamental experimental process ensures that any confounding variables (like user demographics or time of day) are distributed equally between your control and treatment groups.

What is randomization?

400

In power analysis, this is the smallest lift or change in a metric that you care about practically and want to ensure your test can detect.

What is Minimum Detectable Effect?

400

This occurs when you run multiple variants simultaneously (e.g., Change A and Change B) and the effect of Change A depends heavily on whether the user also experienced Change B.

What is an interaction effect?

400

This is the simplest, most conservative method used to control the Family-Wise Error Rate by dividing your target alpha by the number of comparisons being made (alpha/k).

What is the Bonferroni correction?

400

The primary statistical goal of implementing CUPED in an A/B testing framework is to reduce this mathematical property of your metric, allowing you to reach significance faster.

What is variance?

600

If your A/B test has a confidence interval for the treatment effect that ranges from -2% to +5%, this is the conclusion you must draw about the statistical significance of your result.

What is "not statistically significant"?

600

If your A/B test has a statistical power of 80%, this is the exact mathematical probability (expressed as a percentage) that you will commit a Type II error (false negative) if a true effect exists.

What is 20%?

600

You launch a feature that only loads for users on fast 5G networks, while the control group includes everyone. By comparing the groups, you mistakenly conclude your feature drastically increased engagement. This is because your experiment suffers from this specific flaw.

What is selection bias?

600

To detect an SRM, data scientists usually run this specific statistical test on the observed vs. expected sample counts.

What is a Chi-square goodness-of-fit test?

600

For CUPED to successfully reduce variance, the pre-experiment metric and the in-experiment metric must have a strong, non-zero amount of this statistical relationship.

What is correlation (or covariance)?

800

This is the specific statistical test you would use if you want to compare the conversion rates (a categorical, binary metric) between a Control group and a Treatment group.

What is a Chi-square test (of independence)?

800

If you decide you want to detect a smaller effect than originally planned while keeping your power and alpha the same, your required sample size must change in this direction.

What is increase?

800

If you repeatedly peak at your A/B test data and stop it early the moment it hits p < 0.05, this type of error rate will be much higher than your nominal alpha.

What is the Type I error (false positive) rate?

800

The Benjamini-Hochberg procedure controls this among all significant results, making it less conservative and more powerful for large numbers of variants.

What is the False Discovery Rate (FDR)?

800

If an engineer accidentally includes data from after the user was exposed to the treatment variant when calculating the CUPED baseline, it will introduce this issue and can completely wipe out or distort the actual treatment effect.

What is the post-treatment bias?