Distributions
Statistical Dependencies
Experimental Design
Sample Size Calcs
Statistical Algorithms
100
What does the area under a probability density distribution equal?
1
100
What is a correlation?
A description of the relationship between changes in variable A and changes in variable B
100
What are the four ingredients necessary for a experimental design?
1. A null and alternative hypothesis 2. A pre-defined burden of proof (magnitude of effect and significance threshold) 3. A well-selected counterfactual 4. A process to roll out the treatment with integrity
100
How many outcome variables is a single sample size calculation valid for?
1
100
For a trial to give us meaningful results, what four things do we need to see?
High power Low p-value Meaningful confidence interval and/or point estimate Confidence that the trial was executed well
200
How many modes does a binary distribution have?
2
200
What do you need to infer causality?
A proper counterfactual
200
What makes something an RCT?
When the Treatment and Control groups are randomly assigned
200
How do sample sizes relate to the directionality of hypotheses?
A 1-directional hypothesis will likely have a smaller sample size requirement
200
What is the primary difference between a linear and logistic regression?
Linear regression is for continuous outcome variables. Logistic regression is for categorical outcome variables.
300
What does a "left-skewed" distribution mean?
It has a long left-sided tail
300
What does a positive correlation mean?
As one variable goes up, the other variable goes up too
300
When is it appropriate to use a diff-in-diff design?
When you are able to take /have both baseline ("before") and treatment ("after") observations over time
300
How does the level of randomization relate to sample size?
Randomizing treatment at the same level as our outcome variables will result in smaller sample sizes
300
What is the difference between a t-test, an ANOVA test, and a Wilcoxon test?
T-test is for comparing means of two groups assuming normality. ANOVA is for comparing means of multiple groups assuming normality. Wilcoxon is a non-parametric version of the t-test for when assumptions of normality do not hold.
400
Kurtosis describes how sharp a distribution is compared to what?
The normal distribution
400
What does the Pearson's coefficient tell you?
How strong a linear correlation is, where: 1 means perfectly positively correlated -1 means perfectly negatively correlated 0 means not at all correlated
400
When is it appropriate to use a PSM design?
When you are willing to collect data on a large pool of potential controls but you are not willing to drop people
400
How does sample size relate to the: - Desired detectable effect size - Standard deviation of the outcome variable - Statistical threshold - Power
Effect size: inverse Sd: direct Statistical threshold: inverse Power: direct
400
What is the "no free lunch" theorem?
No one algorithm is always better than others. However, for certain circumstances, some algorithms tend to have advantages over others.
500
Are distributions a tool used for descriptive stats or inferential stats?
Trick question: both! Descriptive stats = describes the distribution of only the data you have, e.g. min, mean, median, max, standard deviation, etc. Inferential stats = try to infer something about a bigger population based on the distribution from a sample.
500
Describe the form, direction, and strength of the scatterplot between income per person and babies per woman: http://www.gapminder.org/tools
Linear, negative, reasonably strong
500
When is it appropriate to use a regression discontinuity design?
When there is a somewhat arbitrary threshold that cuts people into the Treatment or Control group, e.g. a geographic border or a cut-off policy
500
Why is it dangerous to test a lot of hypotheses with the same sample?
Because it is possible to see a statistically significant difference by chance. The more hypotheses we test with the same sample, the more we open ourselves up to the possibility of seeing a significant finding just by chance.
500
Describe the difference between the covariates and the data required for a PREDICTIVE model vs. an EXPLANATORY model.
PREDICTIVE - Highly-correlated covariates - From a large dataset with wide variation in outcome variable EXPLANATORY - Hypothesis-driven covariates - From an experimental design