Recap! Jeopardy Template

Evaluation Metrics

Supervised Learning

Unsupervised Learning

Deep Learning

Natural Language

100

Name 2 Regression metrics.

R2, MSE, RMSE, MAPE.

100

Why do we use the Sigmoid function for binary classifications?

It maps values into (0, 1), which can be interpreted as probabilities.

100

Key hyperparameter of K-Means clustering.

The number of clusters (k).

100

Process of adjusting the weights of a NN based on the error of its predictions.

Backpropagation.

100

Basic unit of text used by models.

Token.

200

What is the Confusion Matrix?

Table that shows the counts of TP, TN, FP, and FN for a classifier.

200

How does a Decision Tree split the data in each node?

It looks for the split that maximizes purity (e.g. Gini).

200

Main challenge that recommender systems face with new users.

The Cold Start Problem.

200

Activation function that outputs the input directly if it’s positive, and zero otherwise.

ReLU.

200

What are word embeddings?

Words represented as numerical vectors so that the ones with similar meanings are close in the vector space.

300

Harmonic mean of Precision and Recall.

F1-Score.

300

Name 3 variants of Gradient Descent.

Batch, Stochastic, Mini-Batch.

300

What is dimensionality reduction and why is it useful?

Reducing the nr of input features while preserving relevant information. Useful for visualization, noise reduction, faster computation and avoiding overfitting.

300

What’s the cost of flattening an image and where is it needed?

Loss of spatial information. Needed between the convolutional and the feedforward layers.

300

Carefully crafting instructions to guide a LLM’s output.

Prompt Engineering.

400

Which metric could be misleading when handling unbalanced data and why?

Accuracy. It can be high even if the model ignores the minority class.

400

Difference between Bagging and Boosting.

Bagging trains models in parallel, Boosting trains sequentially correcting previous errors.

400

How does PCA work?

It finds new dimensions (principal components) that maximize variance in the data.

400

Regularization technique that randomly ignores some neurons during training to prevent overfitting.

Dropout.

400

Explain the RAG technique.

Combining document retrieval with a language model to generate informed, context-aware answers.

500

How do you interpret the ROC curve and what are the axes?

x = FPR, y = TPR (recall). The closer the curve is to the top-left corner, the better the classifier.

500

How do L1 and L2 regularizations affect linear model coeffs?

L1 pushes some coeffs to 0, L2 shrinks coeffs.

500

Difference between agglomerative and divisive clustering.

Agglomerative starts with individual points and merges, divisive starts with all points and splits recursively.

500

Typical CNN architecture for a classification problem.

Convolutional and pooling layers to extract features, followed by fully connected layers and a softmax output layer.

500

What’s the key innovation of Transformers and what is it for?

Self-attention. Allows the model to capture dependencies between tokens in parallel, regardless of distance.