Compare Those Models!
Measuring Your Performance
Regressin'
R You Ready to Rumble?
Potent Potables
100

When comparing models using confusion matrix metrics (for example, accuracy), which data set should we never use to choose a model: training, validation, or testing?

Training

100

For false negatives, I have predicted Y equals what value (and was wrong)?

Y=0

100

True or False: adjusted R^2 will never be greater than 1.

True!

100

True or False: when running a regression in R with a categorical variable, I need to create 0 and 1 (binary) dummy variables before I can use that variable in the regression.

False

100

True or False: You will all deserve a drink after you take Exam 1.

True! (But I will also accept "false" for this, because you probably deserved a drink regardless of the exam.)

200

Like adjusted R2 for linear regression, AIC for logistic regression accounts for the number of what in different models?

Variables

200

Tony and Bruce are trying to build a model to predict whether aliens will attack (Y=1) or not (Y=0). Assuming it is much more important to correctly predict true alien attacks, which performance measure should they use: TPR, TNR, or Accuracy?

TPR (True Positive Rate)

200

Tony and Bruce are trying to build a model to predict whether aliens will attack (Y=1) or not (Y=0). They choose to build a logistic regression model and discover the coefficient for "Number of Avengers" is positive. This means as the number of Avengers increases, what happens to the probability that aliens will attack?

Increases

200

glm(Y~X, data=training_data, family="binomial") will return a logistic regression model. What will R return if I run glm(Y~X, data=training_data)?

Linear regression model

200

If I split my wine quality data set into equal 20% validation data sets, how many times will I run my method under cross-validation?

5

300

Tony and Bruce are trying to build a model to predict whether aliens will attack or not. They need a model that anyone on their team, including Clint who uselessly has no mathematical background, can use. Assuming all three models have similar performance results, should they use a logistic regression, a classification tree, or a linear discriminant analysis model?

Classification tree

300

Which other performance measure can I get by doing 1 - FPR?

TNR (True Negative Rate)

300

Tony and Bruce are trying to build a model to predict whether aliens will attack (Y=1) or not (Y=0). After running a logistic regression, they discover the coefficient for "Special Effects Budget" is negative. This means as the special effects budget increases, what happens to the odds of aliens not attacking?

Increases (*not* attacking is Y=0!)

300

If I want to draw a tree in R, what function should I use?

plot()

300

You are analyzing a full data set with 1000 people. 700 of them prefer wine, while 300 of them prefer beer. If you calculate a baseline accuracy in order to predict if someone will prefer beer over wine, what baseline accuracy should you use to compare?

0.70

400

Tony and Bruce are trying to build a model to understand the factors that cause aliens to attack or not. Steve suggests using an LDA model. Nat suggests using a logistic regression model. Nick suggests using a linear regression model. Clint suggests humans stop ticking off aliens with superior firepower, but no one listens to Clint because that would ruin the franchise. Which of the three suggested models should the team use for this?

Logistic regression

400

Which two values do we compare on an ROC curve? (That is, which values are on the y- and x-axis of an ROC curve.)

TPR (y-axis) and FPR or 1-TNR (x-axis) 

400

Tony and Bruce are trying to build a model to predict whether aliens will attack (Y=1) or not (Y=0). They discover the probability of an alien attack tomorrow is 80%. What are the odds of an alien attack tomorrow?

4 (Or 8/2, 4/1, etc.)

400

I have two vectors in R, vector A and vector B:

A: 3.0    5.0

B: 3.0    4.0

What will I get in R if I run the command A*B?

9.0    20.0

400

When running a classification tree to predict if someone prefers red wine over white wine, you notice the first split is made on Gender; the next split on both sides is made on Income. In a logistic regression for the same problem, which of these variables is guaranteed to be significant?

Neither

500

Put the following three methods in order from (expected) best for prediction to worst for prediction for a classification problem: Linear regression, LDA, QDA, and classification tree of size 1.

QDA (best), LDA, linear regression, tree of size 1 (worst)

500

Which other performance measure can I get by doing 1 - PPV?

FDR (False Discovery Rate)

500

Tony and Bruce are trying to build a model to predict whether aliens will attack (Y=1) or not (Y=0). After building their model with 100 variables, they discover the p-value of their model (the F-test) is 0.04. What is the minimum number of variables that must have individual coefficients with p-values of 0.04 or below?

0 (they could all be well over 0.04)

500

I've run a classification tree model called tree1 and made predictions from the tree model called tree.preds. What command do I have to give R to get the probability of Y=1 for my predictions?

tree.preds[,2]

500

You want to predict if a particular brand of vodka will be a high-selling product or not at a given bar. To do this, you collect a data set of 1000 bars at which this brand of vodka is sold. For your first split in a classification tree: what is the maximum Gini, the maximum entropy, and the maximum deviance you could get?

Maximum Gini: 0.5

Maximum entropy: 1

Maximum deviance: 2000

M
e
n
u