DATA SCIENCE EXAM 2 FIRST JEAPORDY

Logistic Regression

KNN & Misclassification

Clustering & PCA

Model Evaluation

Tree Models & Ensembles

100

This classification method estimates the probability of an observation belonging to a specific category

What is logistic regression?

100

K-nearest neighbors estimates the label of a point based on what?

What are the labels of nearby points?

100

PCA is used for what purpose?

What is dimensionality reduction?

100

Cross-validation is used to estimate what?

What is how well a model performs on unseen data?

100

Decision trees use this type of splitting method.

What is recursive binary splitting?

200

Logistic regression maximizes this to determine its coefficients.

What is likelihood?

200

If K is too small, the model tends to do what?

What is underfit (jagged boundaries)?

200

In PCA, the new components are ___ with one another.

What is decorrelated (orthogonal)?

200

How many folds is typically a good balance between bias and computation in cross-validation?

What is 5 folds?

200

Bagging combines many trees trained on bootstrap samples to do what?

What is reduce variance and improve stability?

300

Logistic regression predicts this type of variable

What is a categorical (binary) variable?

300

The misclassification rate is calculated as what?

What is (# Incorrectly Classified) / (Total Records)?

300

K-means clustering minimizes this measure.

What is within-cluster variation?

300

Bootstrapping involves sampling with or without replacement?

What is with replacement?

300

Random Forest differs from bagging by doing what at each split?

What is selecting a random subset of predictors?

400

The output of logistic regression is converted using this mathematical function to keep values between 0 and 1.

What is the sigmoid (logistic) function?

400

In a confusion matrix, the false positive rate is computed as?

What is FP / (FP + TN) or # True negatives classified as positive / total actual negatives?

400

Hierarchical clustering can be visualized using this type of chart.

What is a dendrogram?

400

Lift compares a model’s performance to random chance as a ratio of what?

What is (Response rate with model) / (Response rate at random)?

400

For classification in a Random Forest, how many predictors are typically considered at each split?

What is √p (square root of total predictors)?

500

Given coefficients Intercept = -8.7421 and Balance = 0.0042, which customer (A with $4000 or B with $1200) has a higher approval probability?

Who is Customer A? (≈ 0.9997 vs. 0.0243)

500

Given 40 true positives, 10 false negatives, 15 false positives, and 35 true negatives, what is the misclassification rate?

What is 0.25 (25%)?

500

Why is standardization important before clustering?

What is to prevent bias from features with larger numerical ranges dominating the distance calculation?

500

If a model captures 8 of 20 buyers in the top 10% of customers, what is the lift?

What is 4?

500

Gradient Boosting differs from bagging because it focuses on reducing what?

What is bias (and variance) by sequentially improving the model?