Applied Modeling Sprint Jeopardy Template

Definitions

Preprocessing

Metrics

ML Flow

100

What is a ROC AUC score?

A score that is used to find the accuracy of the classifier.

ROC = Receiver Operating Characteristic.

We draw the ROC curve to visualize the performance of a binary classifier. The x axis shows the false positive rate (specificity), the y axis shows the true positive rate (sensitivity).

If pred prob is 0.5 or less, then it gets label x, if over 0.5 they have label y

100

What can we do with outliers?

Remove them based on a criteria we set for the dataset

100

What are some metrics to test our model in a classification problem?

confusion matrix, accuracy, precision, recall, ROC-AUC, etc.

100

What are some metrics to see how our model is performing within a regression problem?

MAE, R², etc

100

How do we decide if we have a regression or classification problem?

Look at the target values

200

What is an ensemble method (like random forest)?

Machine learning techniques that combines several base models in order to produce one optimal predictive model.

200

What can we do with a column with nans? At what point would you throw out the column?

remove feature;
for the NaN values set the most frequent value of this feature;
for the NaN values set the new class (for instance, nan_value);
the most complex approach - create a separate model for predict missing values for the current feature

It depends on the rest of the values

200

When is a confusion matrix useful?

Let's you see where your model is underperforming

200

Why are model coefficients useful?

Lets us see how a change in a predictor variable affects the target variable

200

What do we do once we've identified the target and the type of problem we have?

Get a baseline!!

300

What is bagging?

A technique of ensemble learning that fits subsets of data in parallel, taking the mean of all predictions

300

How to choose between ordinal encoding and one-hot encoding?

If the feature is innately ordered

300

How is looking at feature importances useful?

Change with features are used, look for leakage, etc

300

Why is a PDP useful?

Lets us see how a change in a predictor variable affects the target variable. A useful method of peering into what can often be black-box models

300

How do we get a baseline in regression, classification problems?

Regression: MAE can use mean() and abs(), etc.

Classification, can use value_counts(), mode(), etc

400

What is boosting?

A technique of ensemble learning that takes an interative approach, fitting data sequentially, converting weak learners into strong learners by adjusting the weight of an observation based on the last classification

400

What is something you can do if a variable is skewed?

Log transformation

400

We want to increase the proportion of positive identifications that were actually correct (precision). A way to do this?

class_weight parameter

change threshold

ex.

model = clf.fit(df[features], df[label])
df["proba"] = model.predict_proba(df[features])[:,1]
threshold = 0.4
df["pred"] = df["proba"].apply(lambda x: 1.0 if x >= threshold else 0.0)

400

How can Shapley be useful?

Zoom in on a single row's prediction to see what is influencing that prediction

400

What kind of problem do you prefer, classification or regression and why?

Cool, you probably prefer the problem you're more comfortable with, so study up on the other type! :)

500

What is linear regression?

A statistical approach to determine the line of best fit to represent the relationship between an independent variable and some dependent variables.

500

What can we do with a high cardinality (categorical) feature?

- If there are several values with 1 value_count(), we can just take 10 most frequent values and change everything else to other (ex. countries column)

- We can drop the column (ex. address (we don't know NLP yet)

- We can reorganize the column by feature engineering some of the values into a new column

500

An example of how log transformation can be useful in modeling?

When a variable is skewed

500

Why is validation important and what are some methods for validating our models

Performance estimation
- 2-way holdout method (train/test split)
- (Repeated) k-fold cross-validation without independent test set
Model selection (hyperparameter optimization) and performance estimation ← We usually want to do this
- 3-way holdout method (train/validation/test split)
- (Repeated) k-fold cross-validation with independent test set

500

Signs of leakage

way too high of val accuracy, feature importance shows extremely high correlation, consider asking yourself if you would have that information before stepping into the situation

Predicting if a student will fail an exam

A variable called "score_on_each_section" or "exam_mistakes"