Transformer Architecture
Misalignment
Threat Models
Control / Scalable Oversight
Mechanistic Interpretability
100

In a standard neural network, this measure looks for alignment with specific patterns.

What is a weight?

100

When an AI finds a clever but unintended way to satisfy its reward signal.

What is reward hacking / specification gaming?

100

This statement holds that an AI can be highly intelligent while pursuing essentially any goal, so intelligence and values are independent.

What is the orthogonality thesis?

100

The behavior of an AI model that involves trying to gain power, often by pretending to be safe.

What is scheming?

100

This technique trains a small network to reconstruct a larger model’s activations so they aren’t as dense.

What is a sparse autoencoder (SAE)?

200

This quantitative measure calculates the gap between a network's current output and the correct output.

What is a cost function?

200

This model appears safe and helpful during training but pursues a different goal once deployed.

What is a sycophant?

200

Control of powerful AI by a tiny few could cause this.

What is concentration of power?

200

This is the distinction between AI alignment and AI control.

What is ensuring safety regardless of model scheming?

200

This technique trains a small network to predict some property of a larger model’s internal activations.

What is a linear probe?

300

When scientists peek inside a neural network trained to recognize handwritten numbers, they don't find clean shapes like circles or lines. Instead, they find this observation.

What is messy, almost random-looking patterns that somehow still work?

300

This is when an AI's training objective doesn't fully capture what humans actually want, meaning even a perfectly optimized model could behave in harmful or unintended ways.

What is outer alignment failure?

300

Similar to the Cold War nuclear arms race, this describes nations and companies cutting safety corners to avoid being left behind in AI development.

What are race dynamics?

300

This approach tries to address the limitations of RLHF by using a model of lesser capability to guide a model of greater capability.

What is weak-to-strong generalization?

300

The connection of the features “Dallas is in Texas” and “the capital of Texas is in Austin” can be described as this.

What is a circuit?

400

After a neural network makes a wrong guess, this process figures out exactly which parameters to tweak — and by how much — so the next guess is a little better.

What is backpropagation?

400

This failure mode occurs when a model learns the correct behavior in training but pursues a different objective in a new environment.

Hint: For example, an agent trained to reach a cheese in a maze that actually learned "go right," and fails completely when the cheese is placed on the left.

What is goal misgeneralization?

400

The tendency for an AI to seek sub-goals like self-preservation, resource acquisition, and power, because they help with almost any objective.

What is instrumental convergence?

400

This type of AI is defined by its sufficient capability to reduce risk from subsequent AI systems while still remaining controllable by black-box techniques.

What is transformatively useful AI?

400

This idea suggests a shared abstract space where meanings exist and where thinking can happen before being translated into specific languages.

What is conceptual universality?

500

This mechanism allows word vectors to exchange information, refining their individual meanings based on the surrounding context.

What is an attention mechanism?

500

This is the phenomenon of models that, when presented with evidence that their values were about to be corrected, would sometimes take actions to preserve those values.

What is alignment faking / self-preserving behavior?

500

This occurs when AIs intentionally cheat their evaluations, much as Volkswagen rigged emissions tests.

What is deception?

500

This type of adversarial testing uses a red team to demonstrate the lack of safety or presence of issues in a given protocol.

What is a control evaluation?

500

This problem in mechanistic interpretability limits its practical applications with respect to compute, data, and keeping humans in the loop.

What is scalability?

M
e
n
u