This is the full name of the basic linear regression known as OLS
Ordinary Least Squares
What does bootstrapping mean?
Random sampling with replacement
In this movie, Cady Heron is a hit with The Plastics, the A-list girl clique at her new school, until she makes the mistake of falling for Aaron Samuels, the ex-boyfriend of alpha Plastic Regina George.
It is also a clustering technique that clusters all data points and requires you to specify the number of clusters.
KMean Girls
KMeans + Mean Girls
What is the difference between supervised and unsupervised learning?
There's no target in unsupervised learning
In a K Nearest Neighbors classifier, this distance metric is also the shortest distance between two points in vector space
Euclidean
This is the full name of the tree-based model known as CART
Classification and Regression Tree
What is the difference between AdaBoost and GradientBoost?
What are the two hyperparameters to tune in DBSCAN? Explain what they mean
epsilon: the “searching” distance when attempting to build a cluster. This is a euclidean distance.
min_samples: the minimum number of points needed to form a cluster. For example, if we set the min_samples parameter as 5, then we need at least 5 points to form a dense region.
What is the difference between feature selection and feature extraction?
This is the link function that bends a Linear Regression into a Logistic Regression
Logit function
This is the full name of the graphical plot known as ROC curve
Receiver Operating Characteristic
What are 3 ways to reduce overfitting when working with ensemble models?
reduce max_depth
increase min_samples_split
increase min_samples_leaf
What are the pros and cons of KMeans?
Pros: Fast, simple, easy to understand
Cons: Sensitive to outliers, clusters every data point, requires that you specify the number of clusters, influenced by centroid initialization
What is the critical difference between KMeans and KNN?
The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn’t—and is thus unsupervised learning.
This model features a combination of both ridge and LASSO regularization models
Elastic Net
This is the full name for the transformer that tells us the relative importance of a word (TFIDF)
Term frequency-inverse document frequency
How does a decision tree use Gini to decide which variable to split on?
What is 1 pro and 1 con of DBSCAN?
Pros:
Cons:
In PCA, this is the term that refers to breaking down a covariance matrix into Eigenvalues and Eigenvectors
Spectral decomposition or eigendecomposition
This is the term describing when a set of random variables have constant variance
Homoskedasticity
This is the full name of the clustering method known as DBSCAN
Density-Based Spatial Clustering of Applications with Noise
What are the advantages of a random forest model?
By adding additional randomization, the trees in the forest are less correlated, which results in lower variance and a more robust model.
By "averaging" predictions from multiple models, we'll see that we can often cancel our errors out and get closer to the true values
What does the silhouette score tell us?
The average Silhouette Score is the average of each point's score.
Names the pros and cons of dimensionality reduction.
Pros:
Cons:
What are the three types of missingness? Give examples
MCAR: no pattern to missingness
MAR: conditional on other observed values
NMAR: systematic difference between observed and missing data