What is a ROC AUC score?
A score that is used to find the accuracy of the classifier.
ROC = Receiver Operating Characteristic.
We draw the ROC curve to visualize the performance of a binary classifier. The x axis shows the false positive rate (specificity), the y axis shows the true positive rate (sensitivity).
If pred prob is 0.5 or less, then it gets label x, if over 0.5 they have label y
What can we do with outliers?
Remove them based on a criteria we set for the dataset
What are some metrics to test our model in a classification problem?
confusion matrix, accuracy, precision, recall, ROC-AUC, etc.
What are some metrics to see how our model is performing within a regression problem?
MAE, R2, etc
How do we decide if we have a regression or classification problem?
Look at the target values
What is an ensemble method (like random forest)?
Machine learning techniques that combines several base models in order to produce one optimal predictive model.
What can we do with a column with nans? At what point would you throw out the column?
It depends on the rest of the values
When is a confusion matrix useful?
Let's you see where your model is underperforming
Why are model coefficients useful?
Lets us see how a change in a predictor variable affects the target variable
What do we do once we've identified the target and the type of problem we have?
Get a baseline!!
What is bagging?
A technique of ensemble learning that fits subsets of data in parallel, taking the mean of all predictions
How to choose between ordinal encoding and one-hot encoding?
If the feature is innately ordered
How is looking at feature importances useful?
Change with features are used, look for leakage, etc
Why is a PDP useful?
Lets us see how a change in a predictor variable affects the target variable. A useful method of peering into what can often be black-box models
How do we get a baseline in regression, classification problems?
Regression: MAE can use mean() and abs(), etc.
Classification, can use value_counts(), mode(), etc
What is boosting?
A technique of ensemble learning that takes an interative approach, fitting data sequentially, converting weak learners into strong learners by adjusting the weight of an observation based on the last classification
What is something you can do if a variable is skewed?
Log transformation
We want to increase the proportion of positive identifications that were actually correct (precision). A way to do this?
class_weight parameter
change threshold
ex.
model = clf.fit(df[features], df[label])
df["proba"] = model.predict_proba(df[features])[:,1]
threshold = 0.4
df["pred"] = df["proba"].apply(lambda x: 1.0 if x >= threshold else 0.0)
How can Shapley be useful?
Zoom in on a single row's prediction to see what is influencing that prediction
What kind of problem do you prefer, classification or regression and why?
Cool, you probably prefer the problem you're more comfortable with, so study up on the other type! :)
What is linear regression?
A statistical approach to determine the line of best fit to represent the relationship between an independent variable and some dependent variables.
What can we do with a high cardinality (categorical) feature?
- If there are several values with 1 value_count(), we can just take 10 most frequent values and change everything else to other (ex. countries column)
- We can drop the column (ex. address (we don't know NLP yet)
- We can reorganize the column by feature engineering some of the values into a new column
An example of how log transformation can be useful in modeling?
Why is validation important and what are some methods for validating our models
Signs of leakage
way too high of val accuracy, feature importance shows extremely high correlation, consider asking yourself if you would have that information before stepping into the situation
Predicting if a student will fail an exam
A variable called "score_on_each_section" or "exam_mistakes"