In a standard neural network, this measure looks for alignment with specific patterns.
What is a weight?
When an AI finds a clever but unintended way to satisfy its reward signal.
What is reward hacking / specification gaming?
This statement holds that an AI can be highly intelligent while pursuing essentially any goal, so intelligence and values are independent.
What is the orthogonality thesis?
The behavior of an AI model that involves trying to gain power, often by pretending to be safe.
What is scheming?
This technique trains a small network to reconstruct a larger model’s activations so they aren’t as dense.
What is a sparse autoencoder (SAE)?
This quantitative measure calculates the gap between a network's current output and the correct output.
What is a cost function?
This model appears safe and helpful during training but pursues a different goal once deployed.
What is a sycophant?
Control of powerful AI by a tiny few could cause this.
What is concentration of power?
This is the distinction between AI alignment and AI control.
What is ensuring safety regardless of model scheming?
This technique trains a small network to predict some property of a larger model’s internal activations.
What is a linear probe?
When scientists peek inside a neural network trained to recognize handwritten numbers, they don't find clean shapes like circles or lines. Instead, they find this observation.
What is messy, almost random-looking patterns that somehow still work?
This is when an AI's training objective doesn't fully capture what humans actually want, meaning even a perfectly optimized model could behave in harmful or unintended ways.
What is outer alignment failure?
Similar to the Cold War nuclear arms race, this describes nations and companies cutting safety corners to avoid being left behind in AI development.
What are race dynamics?
This approach tries to address the limitations of RLHF by using a model of lesser capability to guide a model of greater capability.
What is weak-to-strong generalization?
The connection of the features “Dallas is in Texas” and “the capital of Texas is in Austin” can be described as this.
What is a circuit?
After a neural network makes a wrong guess, this process figures out exactly which parameters to tweak — and by how much — so the next guess is a little better.
What is backpropagation?
This failure mode occurs when a model learns the correct behavior in training but pursues a different objective in a new environment.
Hint: For example, an agent trained to reach a cheese in a maze that actually learned "go right," and fails completely when the cheese is placed on the left.
What is goal misgeneralization?
The tendency for an AI to seek sub-goals like self-preservation, resource acquisition, and power, because they help with almost any objective.
What is instrumental convergence?
This type of AI is defined by its sufficient capability to reduce risk from subsequent AI systems while still remaining controllable by black-box techniques.
What is transformatively useful AI?
This idea suggests a shared abstract space where meanings exist and where thinking can happen before being translated into specific languages.
What is conceptual universality?
This mechanism allows word vectors to exchange information, refining their individual meanings based on the surrounding context.
What is an attention mechanism?
This is the phenomenon of models that, when presented with evidence that their values were about to be corrected, would sometimes take actions to preserve those values.
What is alignment faking / self-preserving behavior?
This occurs when AIs intentionally cheat their evaluations, much as Volkswagen rigged emissions tests.
What is deception?
This type of adversarial testing uses a red team to demonstrate the lack of safety or presence of issues in a given protocol.
What is a control evaluation?
This problem in mechanistic interpretability limits its practical applications with respect to compute, data, and keeping humans in the loop.
What is scalability?