06-reference

3blue1brown but what is a neural network

Sun Apr 19 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: 3Blue1Brown (YouTube) ·by Grant Sanderson
3blue1browngrant-sandersonneural-networksdeep-learningmnistperceptronsigmoidreluweightsbiaseshidden-layersmathematical-pedagogycanonical-explainerfoundational-ai-explainer

3Blue1Brown — But what is a neural network? | Deep learning chapter 1

Why this is in the vault

This is the most-viewed 3Blue1Brown video (22.9M views as of April 2026) and the canonical lay-accessible explainer for “what is a neural network as a piece of math, not a buzzword.” Posted October 2017 — six months after Vaswani et al’s Attention Is All You Need, two years before GPT-2, five years before ChatGPT — and it has aged exceptionally well because it deliberately stops at the structural primitives (neurons-as-numbers, weighted sums, sigmoid squishification, weight matrices, bias vectors, layer composition) that every subsequent architecture, including transformers and diffusion U-Nets, still reduces to. Grant’s pedagogical move is the load-bearing one: an MNIST digit classifier with two hidden layers of 16 neurons, ~13,000 weights and biases, motivated from the question “why might a layered structure behave intelligently?” rather than from a definition. The video keeps because (1) it is the canonical primer to point any non-technical person at when explaining what a neural net actually is — Sanity Check readers, prospective clients, podcast guests, founder family — and we should have one citable vault entry for the canonical explainer; (2) the structural framing (network-as-function with 13K parameters; learning = finding good parameter values; layered abstraction = features-of-features) is the cleanest available decomposition of the foundational object every modern AI system inherits; (3) Grant explicitly closes by noting the “hoped-for” interpretability story (layer 2 = edges, layer 3 = loops, layer 4 = digits) does not turn out to be what the trained network actually does — a humility note that matures the whole introduction and points directly at the mechanistic-interpretability research thread that Anthropic, Sutskever, Karpathy, et al. continue to chase nine years later. The closing exchange with Leysa Lee from Amplify Partners (sigmoid → ReLU as the modern activation default) is a rare on-the-record reminder that even the foundational primer’s specific design choices have already been superseded — a useful inoculation against treating any AI explainer as time-invariant.

Core argument

  1. A neuron is a thing that holds a number between 0 and 1, called its activation. The first layer of an MNIST classifier holds 784 neurons (one per pixel of a 28×28 image), the last layer holds 10 (one per digit), with hidden layers of 16 neurons each in between. The brightest output neuron is the network’s classification.
  2. The hope (not necessarily the reality) is that hidden layers compose subfeatures. Layer 2 might recognize little edges, layer 3 might recognize loops and long lines, layer 4 might recognize “loop on top + line on right = nine.” Whether the trained network actually decomposes this way is an empirical question Grant flags but does not resolve in this video.
  3. Each connection between neurons has a weight; each neuron has a bias. A neuron’s pre-activation is the weighted sum of the previous layer’s activations plus the bias. The post-activation passes that pre-activation through a “squishification” function — historically sigmoid, in modern practice ReLU.
  4. The network with two 16-neuron hidden layers has ~13,000 weights and biases total. “Learning” means finding values for these 13,000 numbers that make the network solve the problem.
  5. Compactly: each layer transition is sigmoid(W·a + b) — one matrix-vector product, one vector add, one elementwise nonlinearity. This is the entire forward pass. Linear algebra is the load-bearing math; matrix-multiply optimizations (GPUs, BLAS) are why training is tractable.
  6. The whole network is just a function from R^784 → R^10. Absurdly complicated, but a function. The “is it just a function?” framing is the inoculation against magical thinking — it’s also the framing that makes the network amenable to gradient-based optimization, which is the next video.
  7. Pedagogical structure. Open with concrete object (the sloppy three). Pose impossibility (write a program from scratch). Introduce minimal primitive (neuron-as-number). Build up structure (layers → weighted sums → activations → matrix form). Close by acknowledging this won’t fully work and the next chapter handles training. The structural template is reusable for any technical explainer.

Mapping against Ray Data Co

Open follow-ups

Sponsorship

This video was production-funded by Amplify Partners, a venture capital firm — the funding was disclosed at the time of publication and surfaces explicitly in the closing exchange with Leysa Lee, an Amplify Partners principal who completed her PhD on the theoretical side of deep learning. The Amplify funding is structurally similar to a corporate sponsorship of a PBS documentary: it underwrites the production cost without dictating editorial content, and the closing interview is the closest thing to an “ad read.” The mathematics, the pedagogical framing, the choice of MNIST as the example, the sigmoid-vs-ReLU technical caveat — all are Grant’s standard editorial choices and the video would read identically without the Amplify branding. Treat as disclosed production-underwriting, not paid placement, and not author-aligned promotion (Amplify is third-party). The video does not pitch any Amplify portfolio company. Sponsorship surface = clean.