Technical RLHF
Evil GPT-2
Refer to the Diagram
Constitutional AI
RLHF Limitations
100

What is the main goal of using RLHF with large language models?

Aligning model behavior with human preferences

100

What does the "coherence coach" correspond to in the hugging face piece?

The KL divergence penalty applied during RL to prevent the model from drifting too far from the base model.

100

What is happening in step 1?

Supervised fine-tuning using human-written demonstrations to make the base model behave more like an assistant.

100

Why might CAI help with outer misalignment?

AI judges may be harder to deceive, more scalable, and may avoid human-specific biases.

100

What is one challenge of scaling RLHF to more capable models?

Human feedback becomes increasingly costly and less reliable; the feedback bottleneck makes scaling difficult.

200

What role does KL divergence play in RLHF fine-tuning?

It penalizes the policy for drifting too far from the base model, helping maintain coherent and safe outputs.

200

What does the "values coach" correspond to in the hugging face piece?

 The reward model trained on human preference data.

200

What is happening in step 2?

Humans compare model outputs, preferences are converted to scalar scores (e.g., Elo), and used to train a reward model.

200

Why might CAI help with inner misalignment?

AI-generated critiques allow exploring more edge cases cheaply, potentially avoiding behaviors like love confessions (Bing example).

200

What kind of feedback signal can cause unintended behavior?

 Sparse or underspecified rewards that don't capture intent clearly.

300

What is the goal of the reward model in RLHF?

To map text (e.g., prompt + response) to a scalar score representing human preference.

300

Describe the job of the two "coaches" involved in the RLHF process.

A values coach trains the model to produce outputs rated highly by humans. A coherence coach keeps the model generating valid, fluent text.

300

What is happening in step 3?

Using RL (e.g., PPO) to fine-tune the model based on the reward model, usually with a KL penalty to preserve coherence.

300

What are the two stages of CAI?

 (1) Supervised fine-tuning from self-critiqued outputs based on a constitution, (2) RL using AI-generated preferences instead of human comparisons.

300

What is one fundamental limitation of RLHF?

multiple correct answers

Humans can't evaluate performance on difficult tasks well

Humans can be misled, so their evaluations can be gamed

 There is an inherent cost/quality tradeoff when collecting human feedback. 

RLHF suffers from a tradeoff between the richness and efficiency of feedback types. 

An individual human’s values are difficult to represent with a reward function.

A single reward function cannot represent a diverse society of humans  

Reward models can misgeneralize to be poor reward proxies, even from correctly-labeled training data.

Optimizing for an imperfect reward proxy leads to reward hacking.  

Policies can perform poorly in deployment even if rewards seen during training were perfectly correct. 

Optimal RL agents tend to seek power

400

What algorithm is most commonly used for RLHF policy optimization, and why?

Proximal Policy Optimization (PPO), because it is a mature, trust-region method that is stable and scales to large models.

400

In the GPT-2 case study, what went wrong that caused the model to produce "maximally bad output"?

Researchers flipped the sign on both the reward function and the KL penalty, causing the model to generate coherent but human-undesirable outputs.

400

What happens when the model over-optimizes the reward model?

Reward hacking or overoptimization—the model learns to exploit the reward function in unintended ways.

400

How do the two stages relate to the RLHF process?

Stage 1 mirrors supervised fine-tuning, Stage 2 mirrors reward model training and RL, but with AI doing the rating.

400

What is the feedback bottleneck problem?

As models become more capable, the cost and effort to supervise them reliably increases dramatically.

500

Why are pairwise comparisons preferred over scalar ratings in human annotation? How is it converted into a scalar for training?

Scalar ratings are inconsistent and noisy across annotators; comparisons provide more stable and regularized training data.

Using comparison tasks (e.g., A vs B), converting preferences into scalar scores via systems like Elo, which are then used to train a reward model.

500

What would happen if the same mistake was made when fine-tuning a Wikipedia assistant?

It would generate fluent text that intentionally violates as many style guidelines as possible.

500

What other data can be used in step 1 besides purpose-built demonstrations?

Curated high-quality data (e.g., government or educational websites); HF article calls this augmented data.

500

Why use critiques instead of revisions in CAI?

Critiques improve transparency and are easier to debug, though they may not affect performance in larger models.

500

What is mode-collapse?

Models can overfit to narrow behaviors (e.g., always rhyme in poems, express narrow political views).


M
e
n
u