What does ARIMA stand for?
Autoregressive Integrated Moving Average
As a very rough guideline, when may it be ok to ignore missing data?
If less than 5% of the data is missing
What are two things we need to do with our time column when working with time series?
1. Set to date time
2. Make the time column our index
What is autocorrelation in timeseries?
In time series data, autocorrelation refers to the correlation of one variable with lagged versions of itself. (You may also hear the term serial correlation.)
If our data does not have stationarity, what do we need to do?
Difference!
What does VAR in timeseries modeling stand for?
Vector Autoregressive modeling
Name 3 methods for handling missing data
1. Deductive imputation: Using logical rules to fill in the missing values
2. Measure of Central Tendency Imputation
3. Single Regression Imputation - build a model!
4. Multiple Imputation - build multiple models and combine the results
5. Pattern Submodel Approach: break our dataset into subsets based on missingness pattern. We will then build one model on each subset, creating many different models.
Name 3 different timeseries data manipulations you can make using Pandas
1. The .diff() calculates the difference in a value between a value at time T and that same value at T-1.
2. .pct_change() works similar to .diff(), except we're calculating the difference as a change in percentage:
3. .rolling() the rolling mean is the mean of a moving window across time periods.
4. .shift() is used to bring values from previous dates forward in time.
How do you perform a train test split with time series data?
train on earlier data and test/evaluate on later data.
in TTS shuffle = off
Moving average models take what as it’s input?
- previous error terms
- the goal is to predict future values base on recent forecasting errors
In missing data, what is the difference between MCAR and MAR?
Missing completely at random vs missing at random.
Missing at random is conditional on observed data. ie: random time when the data couldn't be gathered
I spilled my drink on my data vs. the sensor stopped working between 5-8p
What is Benford's Law?
Benford's law is an observation about the frequency distribution of leading digits in many real-life sets of numerical data.
The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small.
What is seasonality?
Seasonality describes when a time series is affected by factors that take on a fixed and known frequency. (This doesn't necessarily need to be based on seasons in the year)
What is the difference between autocorrelation plots and partial autocorrelation plots?
The partial autocorrelation is like the autocorrelation in that it checks for the correlation between T and lagged versions of itself.
However, the partial autocorrelation controls for all lower-lag autocorrelations.
The partial autocorrelation between Y_t and Y_t-2 is the correlation between Y_t and Y_t-2 that has already taken into account the autocorrelation between Y_t and Y_t-1
What are p, q and d in ARIMA modeling?
d: Differencing parameter: This controls how much we "difference" our time series by.
p: is the number of previous values of Y to put in the model
q: is the number of previous errors to put into the model
In missing data, what is NMAR? Give an example
Not Missing at Random
there is a systematic difference between the data observed and data missing.
- those with lower incomes are less likely to reply to a question about income.
This is a movie staring Tom Hanks in which he plays an Alabama man who lives through major historical events including the Vietnam war and Watergate, all while longing to be reunited with his childhood sweetheart. It's also a model type that models on bootstrapped samples of data and random subsets of features.
Random Forrest Gump
What is Stationarity?
Stationarity means that there aren't systematic changes in our time series over time. There is no trend and there is no seasonality.
Say we run an augmented dickey fuller test. Our alpha is 0.05 and we get a p value of 0.043. What can we conclude?
A p-value under 0.05 would give us evidence to reject the null hypothesis, meaning we can accept that our time series data is stationary
Why are autocorrelation and partial autocorrelation plots helpful?
What is AIC?
Akaike Information Criterion.
A common way to evaluate time series models - it attempts to measure how much information we lose when we simplify reality with a model. The lower the AIC, the better
In this movie, a World War II American Army Medic who served during the Battle of Okinawa, refuses to kill people, and becomes the first man in American history to receive the Medal of Honor without firing a shot. It is also a regularization type...
Hacksaw Ridge
How do we check for stationarity in our data?
Perform an Augmented Dickey-Fuller Test
Name 2 disadvantages to time series analysis methods
This is a movie starring Sean Connery and Nicholas Cage in which a mild-mannered chemist and an ex-con must lead the counterstrike when a rogue group of military men, led by a renegade general, threaten a nerve gas attack from Alcatraz against San Francisco. This is also a classification metric where, in binary classification, a score over 0.5 indicates your model is doing better than random chance.
The ROCk_AUC score