What is a null hypothesis in terms of a relationship between phenomena?
Null Hypothesis: A general statement or default position that there is no relationship between two measured phenomena, or no association among groups. "The boring" choice, nothing special is happening.
What do X and y refer to?
What is variance?
The variability of model predictions for a given data point or value which tell the spread of the data. High variance is a sensitivity to small fluctuations in the training set.
High Variance = Overfitting
What is a categorical variable?
Non-numeric, string
What is overfitting?
a modeling error that occurs when a function is too closely fit to a limited set of data points. Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study
What is statistical significance?
a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (usually indicated by a p-value < .05)
What is logistic regression?
a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
What is the coefficient of determination (R2)?
R2 is a measure of how close each data point fits to the regression line.
It tells us how well the regression line predicts the actual values.
The percentage of variation in 'y' that is accounted for by its regression on 'X'
What is a high cardinality variable?
At what point would you consider dropping such a variable in regression? At what nunique()?
Variable where the number of unique values within a variable is very high
50+
What is a confidence interval, given a sample?
Given a sample, the confidence interval is a calculated range that an unknown population parameter (such as mean) is likely to be within. A 95% confidence interval means there is a 95% chance that the unknown parameter is within the range calculated.
If you took 100 samples, 95% of the time the sample mean would be within the specified range.
Mean Absolute Error (MAE)
Measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.
Linear Regression
Linear Regression is a statistical model that seeks to describe the relationship between some y variable and one or more x variables ("Line of Best Fit").
What is the covariance?
Covariance is a measure of how changes in one variable are associated with changes in a second variable.
What is a method to impute?
df.fillna(0) #fill nulls with a value
df.fillna(method='ffill') #fill based on a method chosen from {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}
df.fillna(value={'A': 0, 'B': 1, 'C': 2, 'D': 3}) #fill each feature with a different value
df.fillna(value=values, limit=1) #fill a set number of nulls
What is a baseline?
A baseline is a very basic model/solution to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.
Mean Squared Error (MSE)
Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.
Multiple Regression
A linear regression model that involves multiple x variables
What is the norm of a vector?
The Norm or Magnitude of a vector is the length of the vector. Since a vector is basically a line, if you treat it as the hypotenuse of a triangle you can use the pythagorean theorem to find the equation for the norm of a vector. We're essentially just generalizing the equation for the hypotenuse of a triangle that results from the pythagorean theorem to n dimensional space.
What is one hot encoding?
This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.
What is model validation?
The process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived.
Root Mean Squared Error (RMSE)
When is RMSE most useful?
The RMSE is a quadratic scoring rule which measures the average magnitude of the error. The difference between forecast and corresponding observed values are each squared and then averaged over the sample. Finally, the square root of the average is taken.
When is it most useful? Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable.
What is ridge regression and when is it used?
Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity.
When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.
What is PCA?
A feature extraction technique that transforms a high dimensional dataset into a new lower dimensional dataset while preserving the maximum amount of information from the original data.
What does SelectKBest do, on a high level?
Selects features according to the k highest scores.
What is the main purpose of using the testing data?
The main purpose of using the testing data set is to test the generalization ability of a trained model.