EDA (concepts)
EDA (code)
Simple linear regression
Normal model
Inference
100

EDA quantitative summary for a quantitative variable?

Summarize the variable.

100

Code for EDA for visual summary of quantitative variable?

Base R: hist(dat$mjage)

ggplot: dat %>% ggplot(aes(x=mjage))+geom_histogram()

100

What are the four assumptions of a simple linear regression model?

Homoscedasticity

Independence between observations

Normality

Linearity

100

What is the normal model? What is it used for?

It is a model that assumes that a variable is distributed as a Normal distribution with some mean and variance. It is used to determine 1) what proportion of units there are for some section of the values of the variable. 2) what values correspond to a proportion of units.

100

What is a p-value?

A p-value is a probability value that measures the likelihood of obtaining observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference.

200

EDA visual summary for a quantitative variable?

Histogram.

200

Code for EDA quantitative summary of quantitative variable?

Base R: summary(dat$mjage)

tidyverse: dat %>% summarise(mean=mean(mjage), sd=sd(mjage), IQR=IQR(mjage), min=min(mjage), max=max(mjage))

200

How do you test for homoscedasticity?

Scatterplot, scale-location plot, residuals vs x plot all work.

200

Is the normal model correct? 

All models are wrong, some are useful.

200

What is a confidence interval?

It's the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.

300

EDA for a categorical variable?

Quantitative: table

Visual: barplot

300

Code for EDA for a categorical variable?

Quant: table(dat$mjage)

Visual: Barplot

Base R: barplot(table(dat$mjage))

ggplot: 

tab1 <- melt(table(dat$mjage)) %>% rename(mjage=Var1)

ggplot(data = tab1, aes(x=mjage, y=value)) + 

  geom_bar(stat = "identity")

300

How do you test for normality?

qq plot.

300
What do the functions pnorm and qnorm do?

pnorm calculates what proportion of a normally-distributed population is less than a given value.

qnorm is the inverse of pnorm. Goes from percent to value.

300

What does it mean to do inference?

We calculate a statistic from a sample and infer that it is the same in the population, with a certain level of uncertainty.
400

EDA visual summary for quantitative vs. categorical variable?

Side-by-side boxplots or side-by-side histograms.

400

Code for side-by-side boxplots?

dat %>% ggplot(aes(x=factor(irsex), y=mjage)) + geom_boxplot()

400

How do you test for which points might be outliers, and whether you should worry about them?

Cook's distance plot.

400

For SAT, X~N(1500, 300). What score do you get if you are within the top 25%?

1702

500

EDA visual summary for quantitative vs quantitative variable?

Scatterplot.

500

Code for scatterplot?

plot(dat$mjage, dat$cigage)

or

dat %>% ggplot(aes(x=mjage, y=cigage)+geom_point()

Remember to check which is x and y!

500

How do you test for linearity? What can you do if you don't have a linear trend?

Scatterplot, find any unusual trends in the residuals vs x plot. Can transform y or x.

500

For SAT, X~N(1500, 300). What % of takers fall between 1200 and 1800?

68%