EDA quantitative summary for a quantitative variable?
Summarize the variable.
Code for EDA for visual summary of quantitative variable?
Base R: hist(dat$mjage)
ggplot: dat %>% ggplot(aes(x=mjage))+geom_histogram()
What are the four assumptions of a simple linear regression model?
Homoscedasticity
Independence between observations
Normality
Linearity
What is the normal model? What is it used for?
It is a model that assumes that a variable is distributed as a Normal distribution with some mean and variance. It is used to determine 1) what proportion of units there are for some section of the values of the variable. 2) what values correspond to a proportion of units.
What is a p-value?
A p-value is a probability value that measures the likelihood of obtaining observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference.
EDA visual summary for a quantitative variable?
Histogram.
Code for EDA quantitative summary of quantitative variable?
Base R: summary(dat$mjage)
tidyverse: dat %>% summarise(mean=mean(mjage), sd=sd(mjage), IQR=IQR(mjage), min=min(mjage), max=max(mjage))
How do you test for homoscedasticity?
Scatterplot, scale-location plot, residuals vs x plot all work.
Is the normal model correct?
All models are wrong, some are useful.
What is a confidence interval?
It's the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.
EDA for a categorical variable?
Quantitative: table
Visual: barplot
Code for EDA for a categorical variable?
Quant: table(dat$mjage)
Visual: Barplot
Base R: barplot(table(dat$mjage))
ggplot:
tab1 <- melt(table(dat$mjage)) %>% rename(mjage=Var1)
ggplot(data = tab1, aes(x=mjage, y=value)) +
geom_bar(stat = "identity")
How do you test for normality?
qq plot.
pnorm calculates what proportion of a normally-distributed population is less than a given value.
qnorm is the inverse of pnorm. Goes from percent to value.
What does it mean to do inference?
EDA visual summary for quantitative vs. categorical variable?
Side-by-side boxplots or side-by-side histograms.
Code for side-by-side boxplots?
dat %>% ggplot(aes(x=factor(irsex), y=mjage)) + geom_boxplot()
How do you test for which points might be outliers, and whether you should worry about them?
Cook's distance plot.
For SAT, X~N(1500, 300). What score do you get if you are within the top 25%?
1702
EDA visual summary for quantitative vs quantitative variable?
Scatterplot.
Code for scatterplot?
plot(dat$mjage, dat$cigage)
or
dat %>% ggplot(aes(x=mjage, y=cigage)+geom_point()
Remember to check which is x and y!
How do you test for linearity? What can you do if you don't have a linear trend?
Scatterplot, find any unusual trends in the residuals vs x plot. Can transform y or x.
For SAT, X~N(1500, 300). What % of takers fall between 1200 and 1800?
68%