This parameter in K-means clustering decides how many centroids to randomly place over the dataset
What is K?
Using this technique to fill in missing data artificially inflates the center of our histogram, and undervalues the variance of our dataset.
What is imputing the mean?
This type of system tries to match a consumer to consumers who like similar things to make recommendations.
What is user-based?
True or False, when forecasting further out from our data set the accuracy increases.
What is False? The accuracy decreases as we move further away from our dataset.
Rather than calculating how likely an event is to happen, we calculate how likely an event is to happen given THIS
What is a PRIOR?
This preprocessing trick is required on all clustering models.
What is Scaling?
The data of interest is not systematically different between respondents and nonrespondents.
What is Missing Completely At Random?
This distance algorithm is best described as following a grid pattern of short right angles.
What is Manhatten/taxi/city ect.
This describes the term for when fluctuations occur over set intervals of time.
What is seasonality?
The result of our bayesian inference.
What is a Posterior?
This model is best used on odd-shaped clusters
What is DBscan
When using PCA we make these two assumptions about our data.
What is,
This vector has a cosign angle of at or near 90 degrees
What is orthogonal?
The amount of correlation between a variable and a lag of itself that is not explained by more recent correlations.
What is Partial Auto Correlation?
True or False. Giving drug A to all weekend appointments and drug B to all weekday appointments is a true experiment.
What is False?
This evaluation metric uses the sum of squared errors
What is inertia?
This is the difference between feature extraction vs feature elimination.
A cosine similarity of .99 and a pairwise distance of .01 indicates this.
What is these two things are VERY similar?
This time series algorithm is useful when we want to model longer term data with sudden fluctuations
What is ARIMA? (or SARIMA or SARIMAX)
Our prior and posterior distributions are the same or can be used together easily.
What is Conjugacy?
In DBScan, this defines the distance from which to group neighboring data points.
What is Epsilon?
The error in this process:
What is fit on y? (PCA is unsupervised)
This describes the cold start problem.
What are early iterations of a user based recommender have too few ratings or users to recommend effectively?
DAILY DOUBLE! Wager up to 1k!
Describe Benford's Law.
This word vectorizer finds similarities to words from a pre-trained model by populating sparse matrix.
What is Word2Vec?