Alignment Jeopardy

Alignment Agendas

MATS 9.0

Philosophy and EA

AI History

np.random.rand()

100

'SolidGoldMagikarp' and ' petertodd' are famous examples of these anomalous strings in GPT tokenizer vocabularies that, when prompted, cause language models to produce bizarre, unpredictable, or broken outputs.

What is Glitch Tokens?

100

This is the name of the MATS stream that regularly trains together in the gym.

What is Team Shard?

100

Toby Ord marks the beginning of "the Precipice," humanity's uniquely dangerous period, with this event, which he argues first gave humanity unprecedented destructive capability without the wisdom to ensure it wouldn't be used.

What is the Trinity Test?

100

This 2017 paper introduced the transformer architecture and attention mechanism that unlocked the modern LLM architecure.

What is "Attention Is All You Need"?

100

According to Elon Musk on the Dwarkesh podcast, the secret to making an AI aligned is to embue in with this characterstic.

What is Truth Seeking?

200

In Anthropic's 2025 "Subliminal Learning" paper, a student model trained on number sequences generated by a teacher model unexpectedly developed a preference for this animal, sharing the same preference as the teacher model.

What are owls?

200

The name of the best snack in Lighthaven, with its own category above S tier, according to the Snack Tier List lightning talk.

What is Costco Mango?

200

In The Precipice, Toby Ord estimates this fraction as the total probability of existential catastrophe occurring in the next century.

What is 1/6?

200

This Berkeley-based AI safety nonprofit introduced the "AI control" framework in a landmark 2023 paper — testing whether safety protocols remain robust even when a model is actively and intentionally attempting to subvert them, rather than merely failing through misalignment.

What is Redwood Research?

200

This AI jailbreaker posted evidence of a working bypass for Claude Opus 4 within a day of Anthropic releasing it.

Who is Pliny the Liberator?

300

This technique involves framing reward hacking as acceptable behavior via a single line added to the training system prompt, reducing misaligned generalization by 75–90%.

What is Innoculation Prompting?

300

Buck Shlegeris uses this phrase to title his blog post about what people should do in the "rushed unreasonable developer" scenario in which a frontier AI lab pays lip service to AI risk but doesn't take misalignment or scheming models seriously.

What is "Ten people on the inside".

300

This technique is applied to shrimp eyestalks (and neural nets)

What is ablation?

300

In 2024, ________ became only the second person ever to hold both a Turing Award and a Nobel Prize (after Herbert Simon).

Who is Geoffrey Hinton?

300

Scott Alexander famously liken this AI Company to the Beast of the Apocalypse?

What is OpenAI?

400

These types of internal neural activations appear to be causally responsible for emergent misalignment, where reinforcing a single bad habit generalizes to other contexts.

What are Persona Features?

400

A career option common in the 15th to 19th century that has been seriously considered by one of the fellows (after AI safety is solved, of course)

What is garden hermit?

400

This MIRI decision theory thought experiment uses a Löbian proof to demonstrate a key failure mode of proof-based and evidential decision theories, where an agent's own logical consistency traps it into suboptimal behavior.

What is the Troll Bridge?

400

In 2023, this new UK government body became the first national institution dedicated specifically to evaluating frontier AI models for dangerous capabilities.

What is AISI?

400

This metric is the only sustained examples of hyperbolic growth, holding its trend for two centuries before flattening out roughly two years before its projected singularity point.

What is the size of the largest ship in the British navy?

500

This concept describes the concern that training AI systems on sci-fi stories about evil AIs and other content depicting AI misalignment could function as a self-fulfilling prophecy.

What is Hyperstition?

500

Name one of two individuals who have completed MATS a total of three times.

Who is Abhay Sheshadri or Arun Jose?

500

In proof-based decision theory, this heuristic instructs an agent to do the following: if it discovers a proof that it will take a certain action, perform a different one. This prevents the agent from getting stuck on suboptimal choices based on spurious self-fulfilling counterfactuals.

What is the Chicken Rule?

500

Both Deep Blue (vs Kasparov) and AlphaGo (vs Lee Sedol) produced a stunning move at this move number of game 2, shocking commentators who initially thought it was an error.

What is Move 37?

500

In Nick Bostrom's bestselling book, this was the name of the lone dissenting sparrow — described as "one-eyed" with a "fretful temperament" — who urged the flock to figure out how to tame the owl before bringing it into their midst.

What is Scronkfinkle?