Data Preparation and Cleaning
Model Evaluation
Machine Learning Workflow
Misc.
Einstein Discovery
100

The practice of creating new variables from existing variables in a dataset.

What is feature engineering?

100

A plot of the false positive rate vs the true positive rate for a classification problem. 

What is an ROC curve?

100

The type of machine learning problem when you are predicting whether or not someone will buy your product.

What is classification?

100

The type of machine learning problem where you can use unlabeled data.

What is unsupervised learning?

100

The type of model ED uses.

What is regression? (linear and logistic)

200

Records in a dataset that are rare, extreme events.

What are outliers? 

200

A model that performs well on training data, but poorly on test data. 

What is overfitting?

200

The type of machine learning problem when you are predicting the temperature tomorrow. 

What is regression?

200

The type of scoring I would use to get a predicting immediately.

What is real time scoring?

200

The type of machine learning problem ED solves.

What is (binomial) classification and regression?

300

The method of replacing missing values with the mean, median, or mode.

What is imputation? 

300

Table used to see model performance at various threshold values.

What is the confusion matrix?

300

A portion of labeled data that is not used during model training or tuning.

What is a test/holdout set?

300

This type of machine learning algorithm is useful for image, text, and sound data. 

What is deep learning?

300

The type of feature engineering ED does automatically.

What is creating interaction terms?

400

The practice of replacing a single column in a dataset with categorical values with multiple columns with 0/1's for each categorical value.

What is one-hot encoding? 

400

A model fails to produce good results on training data.

What is underfitting?

400

The method for handling imbalanced data where you repeat the occurrences of the underrepresented class to even out the number of records in each class.

What is oversampling?

400

When a column in the dataset feeds answers to my model, causing it to overfit.

What is data leakage?

400

This is the format in which ED shows scoring code to users.

What is R Code?

500

What type of categorical variable is a Yelp rating?

What is an ordinal variable?

500

A regression metric that weighs all errors equally.

What is Mean Absolute Error (MAE)?

500

Split your data into 5 segments and hold out each segment once for validation.

What is 5-fold (n-fold) cross validation? 

500

When evaluating a regression problem, which metric is more sensitive to outliers?

Mean Square Error (MSE)

500

The number of folds used in ED's cross validation. 

What is four?