06-reference

3blue1brown large language models explained briefly

Sun Apr 19 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: 3Blue1Brown (YouTube) ·by Grant Sanderson
3blue1browngrant-sandersonllmtransformerattentiongptpretrainingrlhfparametersembeddingsemergent-behaviormathematical-pedagogyfoundational-ai-explainercomputer-history-museum

3Blue1Brown — Large Language Models explained briefly

Why this is in the vault

8-minute primer made for the Computer History Museum exhibit on AI (Nov 2024) — the shortest, most pedagogically tight LLM explainer Grant has produced and the cleanest non-technical decomposition of what an LLM actually is. The vault keeps it for four reasons: (1) the “sophisticated mathematical function that predicts the next word” framing is the cleanest one-line definition of an LLM available anywhere, and the rest of the video is the controlled unpacking of every word in that definition (sophisticated = transformers + attention; mathematical function = parameters and weights; predicts = probability distribution, not certainty; next word = autoregressive, sampled with temperature). RDCO needs a citable canonical definition for any client briefing or Sanity Check piece touching LLMs and this is it. (2) Grant lands the scale comparison that actually sticks — “if a human read GPT-3’s training data nonstop 24/7 it would take over 2,600 years; the compute to train the largest models would take a billion-additions-per-second machine over 100 million years.” This is the “hold-your-attention-on-scale” line for any Sanity Check explainer that needs to ground a non-technical reader on why these systems are different in kind, not degree. (3) Grant explicitly distinguishes pre-training (next-word prediction on internet text) from RLHF (workers flagging unhelpful predictions, parameters tweaked to favor user-preferred completions) — a distinction most popular-press AI coverage collapses, and the foundation for understanding why “post-training matters more than the foundation model” is a dominant 2025–2026 lab thesis. (4) The closing emphasis on emergent behavior (“researchers design the framework but specific behavior is an emergent phenomenon based on how hundreds of billions of parameters are tuned during training; this makes it incredibly challenging to determine why the model makes the exact predictions that it does”) is the citable Grant-quote for any Sanity Check piece on interpretability, alignment, or AI verification — exactly the load-bearing claim Kingsbury and the harness-thesis-dissent cluster are wrestling with at higher abstraction levels. Posted Nov 2024 — pre-Sutskever-SSI fundraising, pre-Sutton-RL-dead-end interview, pre-Karpathy-ghosts-not-animals — and it has aged exceptionally cleanly because Grant deliberately stops at structural primitives that every subsequent thesis still presupposes.

Core argument

  1. A large language model is a sophisticated mathematical function that predicts what word comes next for any piece of text. Instead of predicting one word with certainty, it assigns a probability to all possible next words.
  2. Chatbot interaction is just iterated next-word prediction on a script. Lay out a script template (“interaction between user and AI assistant”), append the user’s input, then have the model predict the next word repeatedly. Sampling less likely words at random (temperature > 0) makes the output feel more natural, which is why a deterministic model gives different answers to the same prompt across runs.
  3. Training tunes ~hundreds of billions of continuous parameters (weights). Parameters start random (gibberish output), and back-propagation iteratively tweaks them to make true-next-word more likely and other words less likely across many trillions of training examples. The “large” in LLM refers to parameter count, not architectural complexity.
  4. Pre-training compute is mind-boggling at scale. A 1-billion-FLOPS hypothetical machine running nonstop would take >100 million years to perform the operations involved in training the largest current models. The number is the scale-anchor for “why this is different in kind from past machine learning.”
  5. Pre-training (next-word prediction) is necessary but insufficient for being a good AI assistant. The second training phase — reinforcement learning with human feedback (RLHF) — has workers flag unhelpful or problematic predictions, and these corrections further change parameters to favor user-preferred completions. The pre-training-vs-RLHF distinction is load-bearing for any subsequent argument about alignment, model behavior, or post-training value capture.
  6. GPUs enable parallel processing, which Transformers exploit. Pre-2017 language models processed text one word at a time; the 2017 Google Transformer paper introduced architectures that “soak it all in at once in parallel.”
  7. Inside a Transformer: words are embedded as long vectors of numbers, then iteratively refined by attention and feedforward layers. Attention lets vectors “talk to each other” so the encoding of “bank” can shift toward “river bank” given context. The feedforward layer stores additional language patterns. Many iterations of these two operations enrich each vector until the final vector predicts the next word.
  8. Behavior is emergent, not designed. “Researchers design the framework for how each of these steps work, but the specific behavior is an emergent phenomenon based on how those hundreds of billions of parameters are tuned during training. This makes it incredibly challenging to determine why the model makes the exact predictions that it does.” This is the citable Grant-quote on the mechanistic-interpretability problem.

Mapping against Ray Data Co

Open follow-ups

Sponsorship

This video was made for an exhibit at the Computer History Museum, with no third-party commercial sponsor. The video description explicitly notes: “Instead of sponsored ad reads, these lessons are funded directly by viewers” (3b1b.co/support). The Computer History Museum partnership is institutional rather than commercial — the museum commissioned the video for an exhibit, not an advertisement, and the editorial choices (next-word-prediction framing, RLHF distinction, emergent-behavior closing) are Grant’s standard pedagogical moves. Treat as author-aligned institutional commission, not paid commercial placement. The mathematics is the mathematics; no commercial incentive distorts the explanation. Sponsorship surface = clean.