What is the concept known as "garbage in garbage out"?
. If you train your model using bad data, then you will most likely generate a bad model .
What is the purpose of a train, test, and validation set?
Train: training the model; test: evaluate the model to find its best hyper-parameters; val: compare model to decide which model to use.
Should you look at some or all the data in exploratory data analysis?
Only some. you should avoid looking closely at any test data in the initial exploratory analysis stage. Otherwise, you might, consciously or unconsciously, make assumptions that limit the generality of your model in an untestable way
What are some of the latest DL models?
Multilayer perceptrons (MLP) and recurrent neural networks (particularly LSTM) have been popular for some time, but are increasingly being replaced by newer models such as convolutional neural networks (CNN) and transformers. CNNs (see Li et al. [2021] for a review) are now the go-to model for many tasks, and can be applied to both image data and non-image data. Beyond the use of convolutional layers, some of the main milestones which led to the success of CNNs include the use of rectified linear units (ReLU), the adoption of modern optimisers (notably Adam and its variants) and the widespread use of regularisation, especially dropout layers and batch normalisation — so give serious consideration to including these in your models. Another important group of contemporary models are transformers (see Lin et al. [2022] for a review).
it all depends on the signal-to-noise ratio in the data set. If the signal is strong, then you can get away with less data; if it’s weak, then you need more data.
Inductive biases of ML models are the kind of relationships they are capable of modelling. Give some examples.
For instance, linear models, such as linear regression and logistic regression, are a good choice if you know there are no important non-linear relationships between the features in your data, but a bad choice otherwise.
What are things you should do when you don't have enough data?
Try to find more data; Data augmentation; transfer learning
During data preparation, using information about the means and ranges of variables within the whole data set to carry out variable scaling will likely lead to —
information leakage between train and test data.
Should you think about how your model will be deployed, and if so, what are some questions you should ask?
if it’s going to be deployed in a resource-limited environment, such as a sensor or a robot, this may place limitations on the complexity of the model. If there are time constraints, e.g. a classification of a signal is required within milliseconds, then this also needs to be taken into account when selecting a model. Another consideration is how the model is going to be tied into the broader software system within which it is deployed; this procedure is often far from simple. Emerging approaches such as ML Ops aim to address some of the difficulties.
Should you do feature selection first or data split first? How about dimensionality reduction and data split?
you should only use the training set to select the features which are used in both the training set and the test set (see Figure 3). The same is true when doing dimensionality reduction. For example, if you’re using principal component analysis (PCA), the component weightings should be determined by looking only at the training data; the same weightings should then be applied to the test set.
How do you know when your test set is appropriate, and what do you do when it is not?
it should not overlap with the training set and it should be representative of the wider population.