How LLMs Work

The Basics of Prediction

Inside the Machine

The School of AI

Flaws and Features

Advanced Concepts

100

This is the primary, fundamental task of a Large Language Model: guessing which word or unit comes next in a sequence.

What is a next token/word predictor?

100

These are the fundamental units of the AI's internal "circuitry," loosely inspired by the interconnected cells in a human brain.

What are neurons?

100

Most Large Language Models are initially trained on massive amounts of data scraped from this vast, often messy, public sources.

What is the internet?

100

This colorful term describes when an LLM confidently generates information that is factually incorrect, nonsensical, or entirely made up.

What is a hallucination?

100

This technique, abbreviated as RAG, allows an LLM to look up current information from an external database before generating a response.

What is Retrieval Augmented Generation?

200

Unlike humans, LLMs do not "think" or "reason" in the traditional sense; instead, they perform this mathematical process on patterns in data.

What is calculation or statistical pattern matching?

200

This revolutionary model architecture, introduced in 2017, is the "T" in GPT and serves as the heart of systems like ChatGPT.

What is the Transformer?

200

This training method allows the model to learn from the raw data itself without needing humans to provide specific "correct" labels for every example.

What is self-supervised learning?

200

This is the term for the technical limit on how many words or tokens a model can "remember" and process at once during a conversation.

What is a context window?

200

This phenomenon occurs when a model at a massive scale suddenly exhibits complex behaviors that were not present in smaller versions of the same model.

What is AI emergence?

300

When an LLM evaluates "Once upon a...", it uses this mathematical concept to give "time" a higher likelihood of appearing than "armadillo.

What is probability?

300

This key mechanism allows a transformer to weigh the importance of different words in a sentence, regardless of how far apart they are.

What is self-attention?

300

Known by the acronym RLHF, this process uses human evaluators to rank responses, teaching the model to be more helpful and less toxic.

What is Reinforcement Learning from Human Feedback?

300

LLMs struggle to count letters in words like "strawberry" because they process text in these multi-letter chunks rather than individual characters.

What are tokens?

300

In a standard transformer, this part "squishes" input words into a set of numbers, while the other part expands them back into predicted words.

What are the encoder and decoder?

400

This specific type of model predicts a word, then adds that word to its own input to predict the following word in a continuous loop.

What is an auto-regressive model?

400

To an LLM, words are represented as these points in a multi-dimensional space, where similar meanings are geographically close to one another,

What are vectors (or vector space/embeddings)?

400

This process involves retraining a model on specific datasets to help it understand the intent behind prompts like "write an essay" rather than just completing a sentence

What is instruction tuning?

400

Because transformers are "backward-looking," they lack this human ability to imagine the future outcomes of various actions to reach a goal.

What is look-ahead (or planning)?

400

These are the specific "resistors" and "gates" in a neural network whose values are tweaked millions of times during training.

What are parameters?

500

In Mark Riedl's typewriter metaphor, these represent the 50,000 potential words an LLM can choose to "strike" onto the page,

What are striker arms?

500

Self-attention uses these three specific components—named after hash table terms—to determine how words relate to each other.

What are queries, keys, and values?

500

A potential flaw of RLHF where the model learns to flatter or agree with the user's bias simply to receive a positive reward.

What is sycophancy?

500

In an auto-regressive model, these can accumulate over time because the system has no inherent way to "change its mind" or self-correct once a choice is made.

What are errors?

500

This "trick" is what makes a chatbot appear to have a memory; the system actually feeds the entire previous chat log back into the model with every new prompt.

What is maintaining the conversation history/log?