What does the p in P-value stand for?
Probability
What does the .split function do?
It will split the data usually a string into a list but can also be used to split the data into a training and test set using num rows
What is Bootstrapping?
a resampling technique used in statistics to estimate the sampling distributing of a statistic
What does the .split function do in this instance and which set of data has a 80% of the data and who has the other 20%?
trainf, testf = ROH_data.split(int(0.8 * ROH_data.num_rows))
print(trainf.num_rows, 'training and', testf.num_rows, 'test instances.')
It will split the data into a 80 20 split. The train data will have 80% of the data and the other 20 is made to test and validate on.
What is positive correlation?
it means that 2 variables move together whether they are increasing or decreasing at the same time.
What does RMSE stand for?
Root Mean Standard Error
What does the .loc() function do?
It lets you call a column from a dataset by the name of the column
Why is it better to use more samples in bootstrapping?
to show numerical stability and precision within your answer
What is the difference between a train and test set?
The Train data is used to build a model and help it identify patterns while the test data is used to validate the accuracy of the model on unseen values.
Name the 4 TA's/CA's
Mohammad, Chloe, Tyler, Max
What does the k stand for in k nearest neighbors?
k is the number of neighbors to consider once sort to find the closest rows/observations in the test data set
explain the difference between df[0] and df[7]
Slide 1
df[0] is the name of the molecule while df[7] is the structure of df[0] is the name row.
What type of datasets is bootstrapping best used for
Datasets with small numbers of samples/datapoints, non-normal datasets, and/or when theoretical formulas are complex
Describe what RMSE does
It finds the magnitude of the observed and predicted values. It helps show how close the model you made is to the actual data.
What country outside of U.S.A. has Dr. Smith spent the most time?
Norway
what is R^2?
a goodness of fit for data
What are the best functions for data cleaning?
np.isnan, .where, np.unique
What types of data can throw off bootstrapping?
Extreme values/outliers
Is it better to have a higher or lower RMSE?
Lower
What is NAN/Null data in a dataset?
Data that is missing from the dataset
What does CSV stand for
Comma separated values.
what does .append do?
it is able to attach another instance to a dataset including another dataset for list
How many samples are enough for bootstrapping?
1,000 - 10,000
What does nearest neighbor mean in the context of a a training data set?
The rows/observations which are closest in terms of values of attributes
what does P value?
measures the strength of evidence against a null hypothesis, indicating the probability of obtaining results at least as extreme as those observed if the null hypothesis is true