Logistic regression uses this link function to connect linear predictors to probabilities.
What is the logit function (log-odds)?
KNN has what computational drawback on large datasets, and how can it be mitigated?
It is computationally expensive at prediction time (O(n)) — mitigate with KD-trees, ball trees, or dimensionality reduction.
In K-means, how are centroids recomputed after re-assignment?
By taking the mean vector of all points in each cluster for every iteration.
What is the main advantage of k-fold cross-validation over a simple train-test split?
Lower variance estimate of test error by averaging over k folds, using all data for both training and testing.
How does bagging reduce variance?
By averaging predictions from multiple independent bootstrap trees, smoothing random fluctuations
Why is logistic regression preferred over linear regression for classification tasks?
Because it constrains outputs to [0, 1] and models the log-odds of categorical outcomes, whereas linear regression can predict impossible probabilities.
When features are on different scales, KNN performance suffers. What is the fix and why?
Standardize or normalize features so one feature doesn’t dominate the distance metric.
Which clustering algorithm builds clusters in a “bottom-up” approach?
Agglomerative hierarchical clustering.
In bootstrap sampling, about what proportion of the original data appears in each resample on average?
About 63.2% of unique observations.
Why do random forests add randomness when choosing features at each split?
To de-correlate trees, making the averaged ensemble more robust.
What problem occurs when classes are perfectly separable?
Complete separation → coefficients diverge to ±∞ and the maximum-likelihood estimates become unstable.
How can KNN handle categorical predictors?
Use Hamming distance or encode categories numerically and use Gower’s distance for mixed-type data.
Why is standardization critical before performing K-means on variables like “Age” and “Income”?
Because features with larger ranges dominate Euclidean distance, distorting cluster boundaries.
A lift of 4 means what in business terms?
The model is 4× better than random at identifying positive cases in the targeted segment.
List two hyperparameters of Gradient Boosting that control bias–variance trade-off.
Number of trees (B) and learning rate (λ) — small λ reduces variance but needs larger B
What is the interpretation of the exponential of a logistic coefficient?
It is the odds ratio — the multiplicative change in odds for a one-unit increase in x.
Why does a low K reduce bias but increase variance?
Each prediction depends on fewer samples → more sensitivity to noise (variance↑) but closer fit to training data (bias↓).
How is the total variance in PCA partitioned among the principal components?
Each component captures a portion of total variance proportional to its eigenvalue; components are orthogonal.
What does a Gini coefficient of 0.5 mean compared to 1 or 0?
Moderate rank-ordering ability (0 = none, 1 = perfect).
What happens when the learning rate is too high in Gradient Boosting?
Overfitting and divergence — the model overshoots the minimum and fits noise.
Which two regularization techniques are often used to stabilize logistic regression and what do they penalize?
L1 (Lasso) penalizes |β| to induce sparsity; L2 (Ridge) penalizes β² to reduce variance and multicollinearity.
What metric should replace accuracy for highly imbalanced binary classification problems?
Precision, recall, F1-score, or AUC — accuracy can be misleading when one class dominates.
Explain one limitation of PCA for classification tasks.
PCA maximizes variance, not class separation, so principal components may not align with discriminative features.
Why can the χ² goodness-of-fit test become unreliable in large datasets?
Because tiny deviations become statistically significant even if practically negligible → p-values misleading.
Conceptually, how do bagging and boosting differ in how they build models?
Bagging builds independent models in parallel to reduce variance; boosting builds sequential models that focus on previous errors to reduce bias.