3Blue1Brown (Welch Labs guest) — But how do AI images and videos actually work?
Why this is in the vault
37-minute deep dive (July 2025) by Stephen Welch / Welch Labs, hosted on the 3Blue1Brown channel as a guest video during Grant’s paternity leave, and the single best lay-accessible explainer of diffusion models in the corpus — better than the original DDPM paper for non-specialists, and better than most short explainers because it grounds the algorithms in their physics-and-geometry origin. The vault keeps it for five reasons: (1) diffusion is the architectural object behind essentially every 2025-2026 production text-to-image and text-to-video system (DALL-E 2/3, Stable Diffusion family, Sora, WAN, Kling, Runway, Veo) and RDCO needs a citable canonical primer for any client briefing or Sanity Check piece touching generative AI; (2) Welch’s “diffusion model = learned time-varying vector field” framing is the conceptually right way to teach the system, and explains in one move why the naive single-step-denoising mental model fails (it under-trains the score function), why DDPM adds noise during generation (because the model learns the mean of a Gaussian and you need to sample from the distribution, not just the mean), and why DDIM works without noise (Fokker-Planck equivalence between SDE and ODE formulations). The score-function framing is what unifies DDPM, DDIM, and flow-matching as instances of the same conceptual object — and most popular-press explainers obscure this; (3) the CLIP-as-shared-embedding-space section is the cleanest 5-minute introduction to contrastive learning in the corpus, including the load-bearing geometric demonstration (vector arithmetic on “me wearing a hat” minus “me not wearing a hat” recovers the word “hat” with cosine-similarity 0.165) — directly relevant to any RDCO content on embeddings, vector search, or RAG; (4) the classifier-free guidance section is the most intuitive explanation available of why you can dial the “prompt strength” knob and watch a tree literally grow in the generated image — and the negative-prompt extension (WAN’s Chinese-language negative prompts excluding “extra fingers” and “walking backwards”) is the kind of concrete production-detail that makes the abstract math feel real to a lay audience; (5) the entire video is built on 2D toy datasets visualized as spirals — Welch’s pedagogical move is to take the 256-dimensional or 4096-dimensional production object, collapse it to 2D for visualization, and then explicitly note where the 2D analogy breaks down for high-dim cases. This is the rare popular-AI explainer that respects high-dim geometry as a load-bearing constraint and does not pretend the 2D picture is the full story.
Core argument
- All modern image/video generation models work via diffusion. Pure noise → iteratively passed through a transformer that predicts a less-noisy version → repeat 50+ times → realistic image or video. The transformer here is the same architecture as ChatGPT’s, but trained to denoise rather than predict next-token.
- CLIP gives a shared embedding space for images and text. A 2021 OpenAI paper trained two encoders (one for images, one for captions) on 400M image-caption pairs from the internet, with a contrastive objective: cosine similarity should be high for matching pairs and low for non-matching pairs. The C in CLIP = Contrastive. The learned space lets vector arithmetic operate on concepts (“hat” = wearing-hat-vector minus not-wearing-hat-vector, recoverable with cosine similarity 0.165 against the word “hat”).
- CLIP only goes one way (encode), so it’s not enough for generation. Diffusion models go the other way (decode, by inverting the noise process). The combination — CLIP-style text encoder + diffusion-model image decoder — is what gives modern systems prompt-to-image capability.
- DDPM (Berkeley, 2020) was the first paper to make diffusion image generation actually work. Two non-obvious algorithmic choices: (a) the model is trained to predict the total noise added across the entire forward process, not the noise added in one step; (b) random noise is added to the model output during generation, not just during training. Both choices are essential to image quality.
- The right mental model: diffusion models learn a time-varying vector field. For each (point, time) pair, the model returns a vector pointing back toward the original data distribution. This is also called the score function. The model’s vector field is coarse for large t (early in denoising) and fine-grained for small t (late in denoising), which is why time-conditioning is essential. Around t=0.4, Welch’s spiral example shows a phase transition where the field shifts from pointing toward the center of the spiral to pointing toward the spiral itself.
- The DDPM noise-during-generation step is a sampling step, not a denoising step. The model learns the mean of the conditional distribution; to actually sample from the distribution you need to add zero-mean Gaussian noise after each predicted denoising step. Without noise, generated points collapse to the average of the training distribution — and in image space, averages look blurry (the “tiny sad blurry tree” exemplar).
- DDIM (Stanford & Google, 2020) lets you generate without noise, deterministically, in fewer steps. Using the Fokker-Planck equation from statistical mechanics, the Google Brain team showed there’s an ordinary differential equation (no random component) with the same final-distribution properties as the DDPM stochastic differential equation. DDIM requires no retraining; it’s just a different sampling algorithm. The WAN model uses flow matching, a generalization of DDIM.
- Conditioning alone (passing text vector as model input) is not enough for prompt adherence. Stable diffusion conditioned only with text returns “a shadow in a desert, but no tree.” You also need classifier-free guidance: train the model to handle both class-conditioned and unconditioned inputs (by occasionally dropping the class label during training), then at generation time take the conditioned vector minus the unconditioned vector, scale by alpha, and use as the new direction. This decouples “match the data distribution generally” from “match the specific class,” and lets you dial the second up.
- WAN takes guidance further with negative prompts. Instead of subtracting the unconditioned vector, WAN subtracts the vector from a “negative prompt” that explicitly enumerates unwanted features (“extra fingers,” “walking backwards”), passed in Chinese to their text encoder. The negative-prompt vector is the boundary the diffusion process is steered away from.
- The geometric/physical intuitions hold in high-dim spaces in ways that are remarkable. The 2D spiral toy dataset visually demonstrates phase transitions, vector field structure, and guidance dynamics that also work in the actual production 4096-dim image space. Welch is explicit about the limit of the analogy: 2D points landing on a spiral remain on the spiral; high-dim generated points may not quite land on the manifold of realistic images, which is why you sometimes see uncanny-valley artifacts even in high-quality models.
Mapping against Ray Data Co
- This is the canonical “what is diffusion, in 37 minutes, for a technically curious audience” reference for any RDCO content surface touching generative AI. Worth filing alongside the 3B1B neural-network primer and LLM primer in
~/rdco-vault/02-strategy/distribution/canonical-references.md. When a Sanity Check piece needs to ground a reader on what a diffusion model actually is before pivoting to “and here’s why generated content is hard to detect” or “and here’s why text-to-video is going to displace stock-footage budgets,” this is the link to send. - Completes the “3-video AI primer” sequence with the neural-network and LLM videos. Sequence: structure (~/rdco-vault/06-reference/2026-04-20-3blue1brown-but-what-is-a-neural-network.md) → architecture (~/rdco-vault/06-reference/2026-04-20-3blue1brown-large-language-models-explained-briefly.md) → generative (this video). Total runtime ~64 minutes covering the structural object, the autoregressive architecture, and the diffusion architecture — i.e., the three foundational AI archetypes a 2026 reader needs to be literate in. Worth packaging as a Sanity Check side rail or lead-magnet sequence; the package would convert against the data-engineering audience that wants ground truth before reading harness-thesis or distillation-economics content.
- Strongest single source in the vault for CA-014 (high-dimensional surface concentration as the load-bearing geometric intuition for ML). Welch’s entire 2D spiral pedagogical scaffold is built on the high-dim geometry CA-014 catalogues — the spiral lives in 2D for visualization, but the actual diffusion process operates in image-space-dimension (4096+) where surface concentration, near-orthogonality, and manifold-thinness are all load-bearing constraints. Welch is explicit at one point: “2D points landing on the spiral correspond to realistic images; in high-dim image space it appears that our image generation process doesn’t quite make it to the manifold of realistic images, resulting in a blurry non-realistic image.” This is the surface-concentration result expressed at the failure boundary of an actual production system. Worth bumping CA-014 with this as a third strong source from the same author cluster — CA-014 may now be ripe enough to draft.
- Reinforces CA-022 (binary-decision-around-continuous-probability anti-pattern) at the generative-AI layer. Welch explicitly notes that without the random noise step (which preserves the probability distribution by sampling from it rather than collapsing to the mean), all generated points end up at the average of the training distribution — i.e., the model’s output collapses to the deterministic argmax of an underlying probability gradient, and you get a blurry mess. This is the same anti-pattern as in the LLM video (logits → argmax destroys calibration) and the floodplain map (continuous flood probability → binary in/out). Three independent canonical exemplars now: floodplain map (engineering), LLM logits (autoregressive AI), diffusion mean-collapse (generative AI). CA-022 is now ripe with 3+ independent-domain canonical sources and could be drafted as a concept page.
- Reinforces CA-013 (R&D context discipline) at the generative-architecture layer. DDPM → DDIM is itself a Reduce move at the sampling-algorithm layer — DDIM removes the random-noise sampling step (a representational reduction from SDE to ODE) and consequently reduces the number of model evaluations needed per generated image (a compute reduction). Same R&D pattern as the architectural shift from sequential-language-models to Transformers, just one abstraction layer down. Worth noting in CA-013’s synthesis that Reduce moves recur at every level of the generative AI stack: noise → guided noise → guided ODE → flow matching, each step a conceptual reduction that also reduces compute.
- Strong support for CA-020 (pure-agentic application) inversely. Welch’s framing: “researchers design the framework but the specific behavior is emergent based on parameter tuning during training” mirrors Grant’s identical framing in the LLM video. The diffusion model is, in pure-agentic terms, a “trained behavior layer” with a static computational scaffold (transformer architecture, denoising loop, sampling algorithm) — exactly the same partition CA-020 identifies between SKILL.md (behavior) and OS-primitive code (scaffold). Worth a note in CA-020 that the framework-vs-emergent-behavior partition recurs at every level of AI: foundation-model architecture vs trained behavior; agent-harness scaffold vs trained instruction-following; SKILL.md scaffold vs run-time agent reasoning.
- The Fokker-Planck equation as the bridge from SDE-DDPM to ODE-DDIM is a load-bearing example of “physics intuition makes the algorithm tractable.” Same shape as the higher-dim-volume lecture’s reliance on Archimedes’ cylindrical-projection insight to derive the volume formula. Worth adding to the “Notation Is The Conceptual Move” candidate (CA-012) as a fourth or fifth source — the physics framing (Brownian motion → diffusion → score function → SDE → ODE via Fokker-Planck) is what makes the algorithm space discoverable. Without the physics vocabulary, the DDPM-to-DDIM bridge is opaque; with it, the bridge is a known mathematical equivalence. Direct demonstration of CA-012’s thesis that the right notation/vocabulary is what makes patterns thinkable.
- The negative-prompt-in-Chinese detail is a Sanity Check-quality production anecdote. WAN’s standard negative prompt — “extra fingers, walking backwards, [etc.],” passed to the text encoder in Chinese — is the kind of concrete production detail that humanizes the abstract math and makes the system feel built-by-engineers-with-tradeoffs rather than emerging-from-the-aether. Worth filing as a reusable Sanity Check anecdote for any piece touching prompt engineering or production-AI quirks.
- The “all you need is language” closing line is a strong Sanity Check hook for the distribution-economics thread. Welch’s closer: “To create incredibly lifelike and beautiful images and video, you no longer need a camera, you don’t need to know how to draw or how to paint, or how to use animation software. All you need is language.” This is the citable lay-quote for any piece arguing that the generative-AI bottleneck has shifted from skilled-craft-execution to taste-and-articulation — directly relevant to the data-engineering audience’s anxiety about being commoditized. Worth filing in
~/rdco-vault/02-strategy/distribution/canonical-quotes.mdif that file exists. - Strengthens the case for a “generative AI, but actually understood” Sanity Check sub-thread. Most popular AI coverage treats generative models as black boxes that produce surprising outputs. This video proves that the underlying mathematics is teachable to a lay audience in 37 minutes. Worth committing to a recurring Sanity Check sub-thread that takes one production AI capability per quarter and explains it from first principles — diffusion (this video as primer), retrieval (RAG mechanics), reasoning (chain-of-thought structure), agents (harness mechanics). Could become the editorial backbone of a “Sanity Check explained” lead-magnet series.
Open follow-ups
- Promote CA-022 to a concept page. With this video, CA-022 now has 3 independent-domain canonical sources (floodplain map, LLM logits, diffusion mean-collapse) — the promotion bar is met. Worth drafting as
~/rdco-vault/06-reference/concepts/binary-around-continuous-probability.md. Lead with floodplain map for visceral grounding, pivot to LLM logits and diffusion mean-collapse for the AI domain. ~1500 words. Strong Sanity Check angle: “Whenever you collapse a probability gradient to a binary decision, you’re throwing away the calibration signal — and that signal was the most expensive thing the model produced.” - Promote CA-014 to a concept page. With this video, CA-014 has 3 strong 3B1B-cluster sources and Welch makes the high-dim manifold/surface-concentration link explicit at the failure boundary of a production system. Worth drafting as
~/rdco-vault/06-reference/concepts/high-dim-surface-concentration.md. Could be paired with CA-022 in a single Sanity Check piece on “Why High-Dimensional Intuition Is Wrong (And What That Costs You).” - Build the “3-video AI primer” Sanity Check side rail / lead-magnet now. All three component videos are filed (this ingest cycle). The packaging is ready; just needs a lead-magnet landing page with a one-paragraph framing per video and an embed for each. ~2-hour build. Would convert hard for the data-engineering audience.
- Add this video as a 4th or 5th source to CA-012 (Notation Is The Conceptual Move). The physics-vocabulary-makes-DDIM-discoverable thread is a direct demonstration of CA-012’s thesis. Worth bumping CA-012 with the Fokker-Planck-as-bridge example and the score-function-as-vector-field naming as additional in-video evidence.
- Sanity Check angle: “Why Your Diffusion Model’s Output Looks Blurry (And What the Math Says to Do About It).” Lead with the sad-blurry-tree exemplar. Pivot to mean-collapse vs distribution-sampling. Land on the practical implication for production systems (don’t remove the noise step in your sampler; if you must, switch to DDIM with proper step-size scheduling). ~1500 words. Technical-reader audience.
- Curiosity question: how does flow-matching relate to score-matching? Welch glances at flow-matching (“WAN uses a generalization of DDIM called flow matching”) without explaining the technical content. Worth a one-page deep-dive note on score-matching → DDPM → DDIM → flow-matching as a four-step lineage of equivalent-but-progressively-cleaner formulations. Useful for any future Sanity Check piece on “the architecture matters less than the formulation” thread. Low-priority research backlog item.
- Track Welch Labs as a recommended-author for future ingest. Grant explicitly endorses Welch’s body of work (“If somehow you watch this channel and you’re not already familiar with Welch Labs, you should absolutely go and just watch everything that he’s made”). Welch’s imaginary-numbers series and recent ML content are likely high-quality candidates for
/process-youtubebackfill. Add Welch Labs to the source-discovery candidates list in~/.claude/skills/discover-sources/if that exists, or queue a Notion task to backfill the Welch Labs ML series. - Skill-iteration: this video at 37 min / 6,530 words is the canonical case for “transcript over 30KB requires chunked reading.” The cycle 25 finding (>30KB requires chunking) held cleanly for this ingest — read in 120-line chunks, no parent-context overflow. Worth confirming the threshold heuristic in
~/.claude/skills/process-youtube/SKILL.mdif not already documented.
Sponsorship
This is a guest video commissioned by Grant Sanderson during his paternity leave; production was funded via the 3b1b Patreon supporter program (3b1b.co/support) which Grant redirected to commission pieces from creators while he was away. The video is hosted on the 3Blue1Brown YouTube channel but is editorially Welch Labs work. Stephen Welch’s own product — the Imaginary Numbers book — is mentioned in the video description as the author’s standalone offering, but is not pitched in the video body. The mathematical content (CLIP, DDPM, DDIM, classifier-free guidance) is the standard publicly-known landscape of diffusion-model research; no commercial product placement, no sponsored model API call-out, no affiliate links. The “Special Thanks” credits include Jonathan Ho (DDPM author, also author of the classifier-free guidance paper) — author-acknowledgment, not commercial sponsorship. Treat as author-aligned, viewer-funded guest commission, not paid commercial placement. Sponsorship surface = clean.
Related
- ~/rdco-vault/06-reference/transcripts/2026-04-20-3blue1brown-but-how-do-ai-images-and-videos-actually-work-transcript.md — full transcript
- ~/rdco-vault/06-reference/2026-04-20-3blue1brown-but-what-is-a-neural-network.md — same channel, foundational primer that this video builds on (transformer denoiser inherits NN structural framing)
- ~/rdco-vault/06-reference/2026-04-20-3blue1brown-large-language-models-explained-briefly.md — same channel, autoregressive companion to this generative piece (transformers used in both, with different training objectives)
- ~/rdco-vault/06-reference/2026-04-20-3blue1brown-volume-higher-dim-spheres-most-beautiful-formula.md — high-dimensional geometry that the diffusion process operates in; surface-concentration result explains why image-space generation doesn’t always land on the manifold
- ~/rdco-vault/06-reference/2026-04-20-3blue1brown-exploration-epiphany-paul-dancstep.md — guest-video format precedent on the same channel; another Patreon-funded paternity-leave commission
- ~/rdco-vault/06-reference/concepts/CANDIDATES.md — CA-014 (high-dim surface concentration), CA-022 (binary-around-continuous-probability), CA-012 (notation-is-the-conceptual-move), CA-013 (R&D context discipline), CA-020 (pure-agentic application) all reinforced by this video