Name 2 Regression metrics.
R2, MSE, RMSE, MAPE.
Why do we use the Sigmoid function for binary classifications?
It maps values into (0, 1), which can be interpreted as probabilities.
Key hyperparameter of K-Means clustering.
The number of clusters (k).
Process of adjusting the weights of a NN based on the error of its predictions.
Backpropagation.
Basic unit of text used by models.
Token.
What is the Confusion Matrix?
Table that shows the counts of TP, TN, FP, and FN for a classifier.
How does a Decision Tree split the data in each node?
It looks for the split that maximizes purity (e.g. Gini).
Main challenge that recommender systems face with new users.
The Cold Start Problem.
Activation function that outputs the input directly if it’s positive, and zero otherwise.
ReLU.
What are word embeddings?
Words represented as numerical vectors so that the ones with similar meanings are close in the vector space.
Harmonic mean of Precision and Recall.
F1-Score.
Name 3 variants of Gradient Descent.
Batch, Stochastic, Mini-Batch.
What is dimensionality reduction and why is it useful?
Reducing the nr of input features while preserving relevant information. Useful for visualization, noise reduction, faster computation and avoiding overfitting.
What’s the cost of flattening an image and where is it needed?
Loss of spatial information. Needed between the convolutional and the feedforward layers.
Carefully crafting instructions to guide a LLM’s output.
Prompt Engineering.
Which metric could be misleading when handling unbalanced data and why?
Accuracy. It can be high even if the model ignores the minority class.
Difference between Bagging and Boosting.
Bagging trains models in parallel, Boosting trains sequentially correcting previous errors.
How does PCA work?
It finds new dimensions (principal components) that maximize variance in the data.
Regularization technique that randomly ignores some neurons during training to prevent overfitting.
Dropout.
Explain the RAG technique.
Combining document retrieval with a language model to generate informed, context-aware answers.
How do you interpret the ROC curve and what are the axes?
x = FPR, y = TPR (recall). The closer the curve is to the top-left corner, the better the classifier.
How do L1 and L2 regularizations affect linear model coeffs?
L1 pushes some coeffs to 0, L2 shrinks coeffs.
Difference between agglomerative and divisive clustering.
Agglomerative starts with individual points and merges, divisive starts with all points and splits recursively.
Typical CNN architecture for a classification problem.
Convolutional and pooling layers to extract features, followed by fully connected layers and a softmax output layer.
What’s the key innovation of Transformers and what is it for?
Self-attention. Allows the model to capture dependencies between tokens in parallel, regardless of distance.