06-reference

3blue1brown but how do ai images and videos actually work

Sun Apr 19 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: 3Blue1Brown (YouTube, guest video) ·by Stephen Welch (Welch Labs)
3blue1brownwelch-labsstephen-welchdiffusion-modelsddpmddimclipdalle2stable-diffusionclassifier-free-guidancebrownian-motionvector-fieldsscore-functionflow-matchinggenerative-aimathematical-pedagogyfoundational-ai-explainerguest-video

3Blue1Brown (Welch Labs guest) — But how do AI images and videos actually work?

Why this is in the vault

37-minute deep dive (July 2025) by Stephen Welch / Welch Labs, hosted on the 3Blue1Brown channel as a guest video during Grant’s paternity leave, and the single best lay-accessible explainer of diffusion models in the corpus — better than the original DDPM paper for non-specialists, and better than most short explainers because it grounds the algorithms in their physics-and-geometry origin. The vault keeps it for five reasons: (1) diffusion is the architectural object behind essentially every 2025-2026 production text-to-image and text-to-video system (DALL-E 2/3, Stable Diffusion family, Sora, WAN, Kling, Runway, Veo) and RDCO needs a citable canonical primer for any client briefing or Sanity Check piece touching generative AI; (2) Welch’s “diffusion model = learned time-varying vector field” framing is the conceptually right way to teach the system, and explains in one move why the naive single-step-denoising mental model fails (it under-trains the score function), why DDPM adds noise during generation (because the model learns the mean of a Gaussian and you need to sample from the distribution, not just the mean), and why DDIM works without noise (Fokker-Planck equivalence between SDE and ODE formulations). The score-function framing is what unifies DDPM, DDIM, and flow-matching as instances of the same conceptual object — and most popular-press explainers obscure this; (3) the CLIP-as-shared-embedding-space section is the cleanest 5-minute introduction to contrastive learning in the corpus, including the load-bearing geometric demonstration (vector arithmetic on “me wearing a hat” minus “me not wearing a hat” recovers the word “hat” with cosine-similarity 0.165) — directly relevant to any RDCO content on embeddings, vector search, or RAG; (4) the classifier-free guidance section is the most intuitive explanation available of why you can dial the “prompt strength” knob and watch a tree literally grow in the generated image — and the negative-prompt extension (WAN’s Chinese-language negative prompts excluding “extra fingers” and “walking backwards”) is the kind of concrete production-detail that makes the abstract math feel real to a lay audience; (5) the entire video is built on 2D toy datasets visualized as spirals — Welch’s pedagogical move is to take the 256-dimensional or 4096-dimensional production object, collapse it to 2D for visualization, and then explicitly note where the 2D analogy breaks down for high-dim cases. This is the rare popular-AI explainer that respects high-dim geometry as a load-bearing constraint and does not pretend the 2D picture is the full story.

Core argument

  1. All modern image/video generation models work via diffusion. Pure noise → iteratively passed through a transformer that predicts a less-noisy version → repeat 50+ times → realistic image or video. The transformer here is the same architecture as ChatGPT’s, but trained to denoise rather than predict next-token.
  2. CLIP gives a shared embedding space for images and text. A 2021 OpenAI paper trained two encoders (one for images, one for captions) on 400M image-caption pairs from the internet, with a contrastive objective: cosine similarity should be high for matching pairs and low for non-matching pairs. The C in CLIP = Contrastive. The learned space lets vector arithmetic operate on concepts (“hat” = wearing-hat-vector minus not-wearing-hat-vector, recoverable with cosine similarity 0.165 against the word “hat”).
  3. CLIP only goes one way (encode), so it’s not enough for generation. Diffusion models go the other way (decode, by inverting the noise process). The combination — CLIP-style text encoder + diffusion-model image decoder — is what gives modern systems prompt-to-image capability.
  4. DDPM (Berkeley, 2020) was the first paper to make diffusion image generation actually work. Two non-obvious algorithmic choices: (a) the model is trained to predict the total noise added across the entire forward process, not the noise added in one step; (b) random noise is added to the model output during generation, not just during training. Both choices are essential to image quality.
  5. The right mental model: diffusion models learn a time-varying vector field. For each (point, time) pair, the model returns a vector pointing back toward the original data distribution. This is also called the score function. The model’s vector field is coarse for large t (early in denoising) and fine-grained for small t (late in denoising), which is why time-conditioning is essential. Around t=0.4, Welch’s spiral example shows a phase transition where the field shifts from pointing toward the center of the spiral to pointing toward the spiral itself.
  6. The DDPM noise-during-generation step is a sampling step, not a denoising step. The model learns the mean of the conditional distribution; to actually sample from the distribution you need to add zero-mean Gaussian noise after each predicted denoising step. Without noise, generated points collapse to the average of the training distribution — and in image space, averages look blurry (the “tiny sad blurry tree” exemplar).
  7. DDIM (Stanford & Google, 2020) lets you generate without noise, deterministically, in fewer steps. Using the Fokker-Planck equation from statistical mechanics, the Google Brain team showed there’s an ordinary differential equation (no random component) with the same final-distribution properties as the DDPM stochastic differential equation. DDIM requires no retraining; it’s just a different sampling algorithm. The WAN model uses flow matching, a generalization of DDIM.
  8. Conditioning alone (passing text vector as model input) is not enough for prompt adherence. Stable diffusion conditioned only with text returns “a shadow in a desert, but no tree.” You also need classifier-free guidance: train the model to handle both class-conditioned and unconditioned inputs (by occasionally dropping the class label during training), then at generation time take the conditioned vector minus the unconditioned vector, scale by alpha, and use as the new direction. This decouples “match the data distribution generally” from “match the specific class,” and lets you dial the second up.
  9. WAN takes guidance further with negative prompts. Instead of subtracting the unconditioned vector, WAN subtracts the vector from a “negative prompt” that explicitly enumerates unwanted features (“extra fingers,” “walking backwards”), passed in Chinese to their text encoder. The negative-prompt vector is the boundary the diffusion process is steered away from.
  10. The geometric/physical intuitions hold in high-dim spaces in ways that are remarkable. The 2D spiral toy dataset visually demonstrates phase transitions, vector field structure, and guidance dynamics that also work in the actual production 4096-dim image space. Welch is explicit about the limit of the analogy: 2D points landing on a spiral remain on the spiral; high-dim generated points may not quite land on the manifold of realistic images, which is why you sometimes see uncanny-valley artifacts even in high-quality models.

Mapping against Ray Data Co

Open follow-ups

Sponsorship

This is a guest video commissioned by Grant Sanderson during his paternity leave; production was funded via the 3b1b Patreon supporter program (3b1b.co/support) which Grant redirected to commission pieces from creators while he was away. The video is hosted on the 3Blue1Brown YouTube channel but is editorially Welch Labs work. Stephen Welch’s own product — the Imaginary Numbers book — is mentioned in the video description as the author’s standalone offering, but is not pitched in the video body. The mathematical content (CLIP, DDPM, DDIM, classifier-free guidance) is the standard publicly-known landscape of diffusion-model research; no commercial product placement, no sponsored model API call-out, no affiliate links. The “Special Thanks” credits include Jonathan Ho (DDPM author, also author of the classifier-free guidance paper) — author-acknowledgment, not commercial sponsorship. Treat as author-aligned, viewer-funded guest commission, not paid commercial placement. Sponsorship surface = clean.