Improving Deep Neural Networks Jeopardy Template

Optimization Algorithms

CNN Architectures

Regularization Techniques

Transformer Models

Reinforcement Learning

100

This classic optimization method updates weights using the negative gradient of the loss function.

What is Gradient Descent?

100

This foundational CNN architecture, introduced in 1998, was designed for handwritten digit recognition.

What is LeNet?

100

This technique randomly disables neurons during training to reduce overfitting.

What is Dropout?

100

This landmark 2017 paper introduced the Transformer architecture with the phrase “Attention Is All You Need.”

What is the Transformer?

100

This learning framework trains agents through rewards and punishments from interactions with an environment.

What is Reinforcement Learning?

200

This variant of gradient descent uses only one training example at a time for updates.

What is Stochastic Gradient Descent (SGD)?

200

This 2012 CNN architecture dramatically improved ImageNet performance using ReLU activations and GPUs.

What is AlexNet?

200

This regularization method adds a penalty proportional to the square of the weights.

What is L2 Regularization (Weight Decay)?

200

This mechanism allows a model to focus on different parts of an input sequence simultaneously.

What is Self-Attention?

200

This component in reinforcement learning decides which action an agent should take.

What is a Policy?

300

This optimizer combines momentum with adaptive learning rates using first and second moment estimates.

What is Adam?

300

This architecture introduced very deep networks using small 3×3 convolution filters.

What is VGGNet?

300

This method stops training when validation performance begins to worsen.

What is Early Stopping?

300

This Transformer-based language model introduced bidirectional pretraining for NLP tasks.

What is BERT?

300

This value represents the expected cumulative future reward from a given state.

What is the Value Function?

400

This optimization technique helps accelerate SGD by adding a fraction of the previous update vector.

What is Momentum?

400

This CNN architecture introduced “skip connections” to address vanishing gradients in deep networks.

What is ResNet?

400

This data augmentation technique flips, rotates, or crops images to improve generalization.

What is Image Augmentation?

400

This family of autoregressive Transformer models is known for large-scale text generation.

What is GPT?

400

This popular RL algorithm updates action values using the Bellman equation and temporal-difference learning.

What is Q-Learning?

500

This optimizer modifies AdaGrad by using an exponentially decaying average of squared gradients.

What is RMSProp?

500

This architecture uses “Inception modules” to process multiple convolution sizes in parallel.

What is GoogLeNet (Inception Network)?

500

This normalization technique stabilizes training by normalizing activations within mini-batches.

What is Batch Normalization?

500

This positional technique injects sequence order information into Transformer embeddings.

What are Positional Encodings?

500

This tradeoff balances trying new actions versus exploiting known rewarding actions.

What is the Exploration–Exploitation Tradeoff?