3Blue1Brown — But what is a neural network? | Deep learning chapter 1
Why this is in the vault
This is the most-viewed 3Blue1Brown video (22.9M views as of April 2026) and the canonical lay-accessible explainer for “what is a neural network as a piece of math, not a buzzword.” Posted October 2017 — six months after Vaswani et al’s Attention Is All You Need, two years before GPT-2, five years before ChatGPT — and it has aged exceptionally well because it deliberately stops at the structural primitives (neurons-as-numbers, weighted sums, sigmoid squishification, weight matrices, bias vectors, layer composition) that every subsequent architecture, including transformers and diffusion U-Nets, still reduces to. Grant’s pedagogical move is the load-bearing one: an MNIST digit classifier with two hidden layers of 16 neurons, ~13,000 weights and biases, motivated from the question “why might a layered structure behave intelligently?” rather than from a definition. The video keeps because (1) it is the canonical primer to point any non-technical person at when explaining what a neural net actually is — Sanity Check readers, prospective clients, podcast guests, founder family — and we should have one citable vault entry for the canonical explainer; (2) the structural framing (network-as-function with 13K parameters; learning = finding good parameter values; layered abstraction = features-of-features) is the cleanest available decomposition of the foundational object every modern AI system inherits; (3) Grant explicitly closes by noting the “hoped-for” interpretability story (layer 2 = edges, layer 3 = loops, layer 4 = digits) does not turn out to be what the trained network actually does — a humility note that matures the whole introduction and points directly at the mechanistic-interpretability research thread that Anthropic, Sutskever, Karpathy, et al. continue to chase nine years later. The closing exchange with Leysa Lee from Amplify Partners (sigmoid → ReLU as the modern activation default) is a rare on-the-record reminder that even the foundational primer’s specific design choices have already been superseded — a useful inoculation against treating any AI explainer as time-invariant.
Core argument
- A neuron is a thing that holds a number between 0 and 1, called its activation. The first layer of an MNIST classifier holds 784 neurons (one per pixel of a 28×28 image), the last layer holds 10 (one per digit), with hidden layers of 16 neurons each in between. The brightest output neuron is the network’s classification.
- The hope (not necessarily the reality) is that hidden layers compose subfeatures. Layer 2 might recognize little edges, layer 3 might recognize loops and long lines, layer 4 might recognize “loop on top + line on right = nine.” Whether the trained network actually decomposes this way is an empirical question Grant flags but does not resolve in this video.
- Each connection between neurons has a weight; each neuron has a bias. A neuron’s pre-activation is the weighted sum of the previous layer’s activations plus the bias. The post-activation passes that pre-activation through a “squishification” function — historically sigmoid, in modern practice ReLU.
- The network with two 16-neuron hidden layers has ~13,000 weights and biases total. “Learning” means finding values for these 13,000 numbers that make the network solve the problem.
- Compactly: each layer transition is
sigmoid(W·a + b)— one matrix-vector product, one vector add, one elementwise nonlinearity. This is the entire forward pass. Linear algebra is the load-bearing math; matrix-multiply optimizations (GPUs, BLAS) are why training is tractable. - The whole network is just a function from R^784 → R^10. Absurdly complicated, but a function. The “is it just a function?” framing is the inoculation against magical thinking — it’s also the framing that makes the network amenable to gradient-based optimization, which is the next video.
- Pedagogical structure. Open with concrete object (the sloppy three). Pose impossibility (write a program from scratch). Introduce minimal primitive (neuron-as-number). Build up structure (layers → weighted sums → activations → matrix form). Close by acknowledging this won’t fully work and the next chapter handles training. The structural template is reusable for any technical explainer.
Mapping against Ray Data Co
- This is the canonical “explain a neural network” reference for any RDCO content surface. When Sanity Check needs to ground a piece in “what is the actual object we’re talking about,” this is the link. When a client briefing needs to point a non-technical exec at one 18-minute primer, this is the link. When a Sanity Check explainer wants to motivate a more advanced concept (transformers, diffusion, RL) by stating “you remember from the 3B1B explainer, a neural net is a function from one vector space to another,” this is the citable predecessor. Worth filing as the canonical lay-AI primer in
~/rdco-vault/02-strategy/distribution/canonical-references.mdif that file doesn’t already exist. - Pairs directly with the LLM and diffusion explainers in this same ingest cycle. The 2024 LLM explainer (~/rdco-vault/06-reference/2026-04-20-3blue1brown-large-language-models-explained-briefly.md) and the 2025 Welch Labs diffusion guest video (~/rdco-vault/06-reference/2026-04-20-3blue1brown-but-how-do-ai-images-and-videos-actually-work.md) both assume the structural framing this video establishes. Together the three form a coherent “watch in this order” sequence for any reader who needs to go from zero to “I understand why diffusion models can generate images.” Worth packaging as a Sanity Check side rail or lead-magnet sequence — a “3-video AI primer” landing page would have unusually high pull-through.
- Maps directly into CANDIDATES.md CA-014 (high-dimensional surface concentration as the load-bearing geometric intuition for ML). Grant’s network has a 13,000-dimensional parameter space; the loss landscape over that space inherits all the high-dim geometry pathologies (curse of dimensionality, surface concentration, optimization-on-thin-shells) that the 5-dim-volume-peak lecture documents. The two videos together make the mapping explicit: structure (this video) operates over a parameter space whose geometry (the higher-dim-volume lecture) determines what training can and cannot find. Strengthens CA-014 by adding a third 3B1B source that grounds the geometric facts in the actual object that lives in the geometry.
- Reinforces CA-013 (R&D context discipline) at the algorithmic-architecture layer. Grant explicitly notes that 13,000 parameters is “a lot to think about” and immediately introduces the matrix-vector compact notation as the conceptual reduction that lets humans actually reason about the system. Same pattern as IndyDevDan’s R&D framework: when the raw representation overwhelms working memory, you need a compact notation that reduces what you have to hold in mind. The matrix-vector compact form is to neural network reasoning what SKILL.md compact-form is to multi-agent system reasoning.
- Reinforces CA-022 (binary-decision-around-continuous-probability anti-pattern) inversely. The output layer of the MNIST classifier returns 10 continuous activations between 0 and 1, then we pick the brightest. This is exactly the “preserve the probability, don’t collapse to a binary” discipline CA-022 advocates — except every downstream consumer of an MNIST classifier (autoML pipeline, OCR system, document indexer) immediately collapses the 10-vector to argmax and throws away the calibration signal. Worth flagging as the canonical demonstration that the model already exposes the probability gradient — the binary-collapse happens at the application layer, not the model layer. Strengthens CA-022 with a concrete worked example in the AI/ML context.
- The “hoped-for-vs-actual” interpretability gap is the inoculation against assuming agent-system internals match our intent. Grant flags the gap explicitly: we hope layer 2 recognizes edges, layer 3 recognizes loops, but training doesn’t necessarily produce that decomposition — the network finds some representation that works but not necessarily the human-interpretable one. Direct map to RDCO multi-agent systems: we hope
/check-boardinvokes skills in a clean priority-then-channel-then-cycle order, but the trained behavior may be doing something subtly different that happens to produce passable outputs. Worth flagging as a recurring audit question for any skill: “what do we hope the skill is doing, vs what is it empirically doing on the last 10 invocations?” — a candidate/skill-postmortemjob that samples actual invocations and reports the divergence. - The pedagogical structure (concrete object → impossibility → primitive → composition → matrix form → acknowledgment of remaining work) is the structural template for technical Sanity Check pieces. Same shape as the higher-dim-volume lecture’s “puzzle-first, formalism-second” pattern. Worth documenting as a reusable Sanity Check editorial pattern: open with a concrete, almost-trivial object; pose the impossibility of the naive solution; introduce the smallest primitive; build up the structure; reach a compact notation; close by acknowledging what the next piece will resolve. ~1500 words per piece, structured this way, would land hard with the data-engineering audience.
- Sigmoid → ReLU is a load-bearing reminder that “best practice” in AI has a half-life measured in years. Grant ships sigmoid in 2017; Leysa Lee’s outro already names ReLU as the modern default. By 2024, GELU and SwiGLU dominate transformer feedforward layers; by 2026, the activation function is rarely the most interesting design choice in a new model. Worth holding as a Sanity Check angle: “Today’s foundational explainer ships yesterday’s defaults — and that’s fine.” Inoculates the audience against treating any single primer as time-invariant truth.
Open follow-ups
- Promote CA-014 (surface concentration) and consider drafting a Sanity Check piece pairing this video with the higher-dim-volume lecture. Title candidate: “Your Neural Network Lives in 13,000-Dimensional Space (and You Have No Intuition for What That Means).” Open with Grant’s MNIST classifier from this video; pivot to the surface-concentration result from the higher-dim-volume lecture; close on “and this is why optimization doesn’t behave the way you’d expect.” ~1500 words. CA-014 ripens as soon as a non-3B1B source covers the same geometric ground; this Sanity Check piece could be the trigger.
- Build the “3-video AI primer” landing page or Sanity Check side rail. Sequence: this video (structure) → 2024 LLM explainer (architecture) → 2025 Welch Labs diffusion (generative). 60 minutes total runtime. Frame as “the prerequisite for everything else we publish about AI.” Would convert as a lead magnet for the data-engineering audience that wants to ground itself before reading harness-thesis content.
- Document the pedagogical pattern as a reusable Sanity Check editorial template. Concrete object → impossibility → primitive → composition → compact notation → acknowledgment of remaining work. Pair with Grant’s “puzzle-first, formalism-second” pattern from the higher-dim-volume lecture as twin structural templates. Worth filing in
~/rdco-vault/02-strategy/sanity-check/style-guide/if that path exists. - Curiosity question: how much of modern transformer interpretability research is the working-out of Grant’s 2017 humility note? Anthropic’s mech-interp program, Karpathy’s “neural net microscopy” remarks, OpenAI’s superposition work — all are downstream of “the trained network does NOT decompose into the features we hoped.” Worth a vault note tracing the explicit research lineage from this video’s closing acknowledgment to current interpretability work. Low-priority research backlog item.
- Skill-iteration: the audit question “what does the skill empirically do on the last 10 invocations” is a strong candidate for a
/skill-postmortemskill. Grant’s hoped-vs-actual gap maps cleanly onto skill behavior auditing: most skills have an aspirational README and an empirical behavior, and the gap is invisible without sampling. Worth queuing as a Notion task. - Consider a “How RDCO’s autonomous loop is just a 13,000-knob neural net” Sanity Check piece. Stretch the analogy: every cron, every skill threshold, every Notion-board priority is a “weight”; the autonomous loop is a function from “vault state” to “actions taken”; tuning the weights is the founder-in-the-loop work that
/improveautomates. The analogy is loose but the pedagogical move (structure-as-function) translates directly. Lower-confidence content angle, but worth holding for a slow news week.
Sponsorship
This video was production-funded by Amplify Partners, a venture capital firm — the funding was disclosed at the time of publication and surfaces explicitly in the closing exchange with Leysa Lee, an Amplify Partners principal who completed her PhD on the theoretical side of deep learning. The Amplify funding is structurally similar to a corporate sponsorship of a PBS documentary: it underwrites the production cost without dictating editorial content, and the closing interview is the closest thing to an “ad read.” The mathematics, the pedagogical framing, the choice of MNIST as the example, the sigmoid-vs-ReLU technical caveat — all are Grant’s standard editorial choices and the video would read identically without the Amplify branding. Treat as disclosed production-underwriting, not paid placement, and not author-aligned promotion (Amplify is third-party). The video does not pitch any Amplify portfolio company. Sponsorship surface = clean.
Related
- ~/rdco-vault/06-reference/transcripts/2026-04-20-3blue1brown-but-what-is-a-neural-network-transcript.md — full transcript
- ~/rdco-vault/06-reference/2026-04-20-3blue1brown-large-language-models-explained-briefly.md — same author, 7-year-later companion piece on LLMs that builds on this video’s primitives
- ~/rdco-vault/06-reference/2026-04-20-3blue1brown-but-how-do-ai-images-and-videos-actually-work.md — same channel, guest-video on diffusion models that assumes this video’s framing
- ~/rdco-vault/06-reference/2026-04-20-3blue1brown-volume-higher-dim-spheres-most-beautiful-formula.md — high-dimensional geometry of the parameter space this network lives in
- ~/rdco-vault/06-reference/2025-10-17-dwarkesh-karpathy-ghosts-not-animals.md — Karpathy on what training actually finds vs what we hope it finds; downstream of Grant’s interpretability humility note
- ~/rdco-vault/06-reference/2026-04-19-dwarkesh-ilya-sutskever-age-of-research.md — Sutskever on the historical arc that took the architecture from this 2017 explainer to modern frontier models
- ~/rdco-vault/06-reference/concepts/CANDIDATES.md — CA-014 (high-dim surface concentration), CA-013 (R&D context discipline), CA-022 (binary-around-continuous-probability) all draw on this video as a supporting source