This classic optimization method updates weights using the negative gradient of the loss function.
What is Gradient Descent?
This foundational CNN architecture, introduced in 1998, was designed for handwritten digit recognition.
What is LeNet?
This technique randomly disables neurons during training to reduce overfitting.
What is Dropout?
This landmark 2017 paper introduced the Transformer architecture with the phrase “Attention Is All You Need.”
What is the Transformer?
This learning framework trains agents through rewards and punishments from interactions with an environment.
What is Reinforcement Learning?
This variant of gradient descent uses only one training example at a time for updates.
What is Stochastic Gradient Descent (SGD)?
This 2012 CNN architecture dramatically improved ImageNet performance using ReLU activations and GPUs.
What is AlexNet?
This regularization method adds a penalty proportional to the square of the weights.
What is L2 Regularization (Weight Decay)?
This mechanism allows a model to focus on different parts of an input sequence simultaneously.
What is Self-Attention?
This component in reinforcement learning decides which action an agent should take.
What is a Policy?
This optimizer combines momentum with adaptive learning rates using first and second moment estimates.
What is Adam?
This architecture introduced very deep networks using small 3×3 convolution filters.
What is VGGNet?
This method stops training when validation performance begins to worsen.
What is Early Stopping?
This Transformer-based language model introduced bidirectional pretraining for NLP tasks.
What is BERT?
This value represents the expected cumulative future reward from a given state.
What is the Value Function?
This optimization technique helps accelerate SGD by adding a fraction of the previous update vector.
What is Momentum?
This CNN architecture introduced “skip connections” to address vanishing gradients in deep networks.
What is ResNet?
This data augmentation technique flips, rotates, or crops images to improve generalization.
What is Image Augmentation?
This family of autoregressive Transformer models is known for large-scale text generation.
What is GPT?
This popular RL algorithm updates action values using the Bellman equation and temporal-difference learning.
What is Q-Learning?
This optimizer modifies AdaGrad by using an exponentially decaying average of squared gradients.
What is RMSProp?
This architecture uses “Inception modules” to process multiple convolution sizes in parallel.
What is GoogLeNet (Inception Network)?
This normalization technique stabilizes training by normalizing activations within mini-batches.
What is Batch Normalization?
This positional technique injects sequence order information into Transformer embeddings.
What are Positional Encodings?
This tradeoff balances trying new actions versus exploiting known rewarding actions.
What is the Exploration–Exploitation Tradeoff?