Linear Models + Linear Algebra Review Jeopardy Template

Error Metrics + Random

Regression

Linear Alg

EDA

Modeling + 1 random

100

What is a null hypothesis in terms of a relationship between phenomena?

Null Hypothesis: A general statement or default position that there is no relationship between two measured phenomena, or no association among groups. "The boring" choice, nothing special is happening.

100

What do X and y refer to?

X - feature matrix, independent variables
y - the target, response variable, predicted variable, outcome variable

100

What is variance?

The variability of model predictions for a given data point or value which tell the spread of the data. High variance is a sensitivity to small fluctuations in the training set.

High Variance = Overfitting

100

What is a categorical variable?

Non-numeric, string

100

What is overfitting?

a modeling error that occurs when a function is too closely fit to a limited set of data points. Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study

200

What is statistical significance?

a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (usually indicated by a p-value < .05)

200

What is logistic regression?

a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

200

What is the coefficient of determination (R²)?

R² is a measure of how close each data point fits to the regression line.

It tells us how well the regression line predicts the actual values.

The percentage of variation in 'y' that is accounted for by its regression on 'X'

200

What is a high cardinality variable?

At what point would you consider dropping such a variable in regression? At what nunique()?

Variable where the number of unique values within a variable is very high

50+

200

What is a confidence interval, given a sample?

Given a sample, the confidence interval is a calculated range that an unknown population parameter (such as mean) is likely to be within. A 95% confidence interval means there is a 95% chance that the unknown parameter is within the range calculated.

If you took 100 samples, 95% of the time the sample mean would be within the specified range.

300

Mean Absolute Error (MAE)

Measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.

300

Linear Regression

Linear Regression is a statistical model that seeks to describe the relationship between some y variable and one or more x variables ("Line of Best Fit").

300

What is the covariance?

Covariance is a measure of how changes in one variable are associated with changes in a second variable.

300

What is a method to impute?

df.fillna(0) #fill nulls with a value

df.fillna(method='ffill') #fill based on a method chosen from {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}

df.fillna(value={'A': 0, 'B': 1, 'C': 2, 'D': 3}) #fill each feature with a different value

df.fillna(value=values, limit=1) #fill a set number of nulls

300

What is a baseline?

A baseline is a very basic model/solution to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.

400

Mean Squared Error (MSE)

Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

400

Multiple Regression

A linear regression model that involves multiple x variables

400

What is the norm of a vector?

The Norm or Magnitude of a vector is the length of the vector. Since a vector is basically a line, if you treat it as the hypotenuse of a triangle you can use the pythagorean theorem to find the equation for the norm of a vector. We're essentially just generalizing the equation for the hypotenuse of a triangle that results from the pythagorean theorem to n dimensional space.

400

What is one hot encoding?

This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

400

What is model validation?

The process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived.

500

Root Mean Squared Error (RMSE)

When is RMSE most useful?

The RMSE is a quadratic scoring rule which measures the average magnitude of the error. The difference between forecast and corresponding observed values are each squared and then averaged over the sample. Finally, the square root of the average is taken.

When is it most useful? Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable.

500

What is ridge regression and when is it used?

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity.

When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.

500

What is PCA?

A feature extraction technique that transforms a high dimensional dataset into a new lower dimensional dataset while preserving the maximum amount of information from the original data.

500

What does SelectKBest do, on a high level?

Selects features according to the k highest scores.

500

What is the main purpose of using the testing data?

The main purpose of using the testing data set is to test the generalization ability of a trained model.