“Some thoughts on the Sutton interview” — Dwarkesh Patel

Episode summary

A reflective post-mortem essay following Dwarkesh’s Richard Sutton interview. Dwarkesh steel-mans Sutton’s bitter-lesson critique of LLMs (compute is wasted at deployment, training data is inelastic human data, LLMs build a model of “what humans say next” rather than a true world model, no continual learning), then offers his own counter: imitation learning and RL aren’t categorically different — they’re continuous. Pre-trained LLMs are a useful prior, like fossil fuels were a necessary intermediary. The Sutton critique is identifying real gaps, but the gaps don’t doom the LLM paradigm — they shape what comes next.

Key arguments / segments

[00:00] Setup: Dwarkesh acknowledges he understands Sutton’s view better post-interview than during. Apologizes for any misreading. Promises a steelman.
[00:00] Sutton steelman pt 1 (compute waste): The bitter lesson is not “throw maximum compute” — it’s “find techniques that scalably leverage compute.” Most LLM compute is spent at deployment, where no learning happens. That’s structurally inefficient.
[00:01] Sutton steelman pt 2 (data inelasticity): Even RLVR uses human-furnished playgrounds. The agent is never engaging organically with the world. Human data is inelastic and hard to scale.
[00:01] Sutton steelman pt 3 (no true world model): LLMs model “what a human would say next” — they rely on human-derived concepts. Train one on data through 1900, it can’t derive relativity from scratch.
[00:02] Sutton steelman pt 4 (no continual learning): We need a new architecture so agents can learn on the fly like humans/animals — making the special training phase obsolete.
[00:02] Dwarkesh’s pushback (preview): The dichotomies aren’t as sharp as Sutton frames them. Imitation learning is continuous with RL. Models of humans serve as a useful prior for true world models. Test-time fine-tuning could plausibly replicate continual learning.
[00:03] Fossil-fuels analogy (Sutskever): Pre-training data is like fossil fuels — non-renewable, but absolutely crucial to get from water wheels to solar/fusion. You can’t skip the intermediary.
[00:04] AlphaGo vs AlphaZero: AlphaZero (no human games) was better than AlphaGo (initialized on human play) — but AlphaGo was still superhuman. Human data wasn’t actively detrimental, just not helpful at scale. AlphaZero also used way more compute.
[00:05] Cultural learning analogy: Humans accumulate knowledge across generations via something more analogous to imitation learning than RL-from-scratch. Language, legal systems, almost all phone tech — we didn’t invent any of it. Neither pure SL nor pure RL describes human learning. “What planes are to birds, supervised learning might end up being to human cultural learning.”
[00:06] Imitation learning is short-horizon RL: One-token episode, reward proportional to next-token prediction quality. Pre-training is a useful prior that lets RL kick in to win IMO golds and write working applications from scratch.
[00:07] World-model semantics: Whether you call the LLM’s representation a “world model” or “model of humans” is semantic — what matters is whether it helps you bootstrap learning from ground truth. The pasteurized-milk-but-served-cold analogy.
[00:07] LLMs do build representations: Their training process incentivizes a deep representation of the world. Defining “world model” by a presumed-necessary process rather than capabilities is begging the question.
[00:08] Continual learning as the hobby horse: An LLM RL’d on outcome rewards learns ~1 bit per episode, where an episode is tens of thousands of tokens. Animals extract way more signal from their environment.
[00:09] Sutton’s “transition model”: In Sutton’s OAK architecture, an outer-loop RL incentivizes some other learning system to extract maximum signal. Translating to LLMs: fine-tune on observed tokens. Researchers report the naive version doesn’t work well.
[00:09] Possible LLM continual-learning hack: Make supervised fine-tuning a tool call. Outer-loop RL incentivizes the model to teach itself via SL to handle problems that don’t fit in context.
[00:10] In-context learning analogy: ICL emerged spontaneously from the long-sequence training incentive. If information could flow across windows longer than context limits, models could meta-learn the same flexibility.
[00:10] Concluding frame: Evolution does meta-RL to make an RL agent (animals) that selectively does imitation learning. With LLMs, we’re going the opposite direction: a base model doing pure imitation learning, hoping enough RL will produce a coherent agent with goals and self-awareness. Maybe this won’t work, but Sutton’s first-principles arguments don’t prove much.
[00:11] Ground truth: Today’s models are actually getting a lot of RL on ground truth — Sutton’s strict critique is less applicable than it sounds.
[00:11] Final concession: Even if Sutton’s Platonic ideal isn’t the path to first AGI, his critique identifies real gaps — abysmal sample efficiency, dependence on exhaustible human data, lack of continual learning. “If LLMs do get to AGI first, the successor systems they build will almost certainly be based on Richard’s vision.”

Notable claims

1 bit per episode: An LLM RL’d on outcome rewards learns ~1 bit per episode (where an episode can be tens of thousands of tokens). [00:08] — striking sample-efficiency framing.
Sutskever’s fossil-fuels analogy: pre-training data = fossil fuels — non-renewable but necessary intermediary. From a Sutskever talk a couple months prior. [00:03]
AlphaGo vs AlphaZero compute: AlphaZero also used much more compute than AlphaGo — under-cited fact in the “human data isn’t needed” discourse. [00:04]
Continual-learning hack: Make SFT a tool call; outer-loop RL incentivizes the model to teach itself via SL. [00:09]
Meta-RL inversion: Evolution did meta-RL to make an RL agent that does imitation. LLMs invert this: build an imitation-learner first, hope RL turns it into an agent. [00:10]
Concession to Sutton: Successor systems to LLMs (post-AGI) will likely be Sutton-style. [00:11]

Guests

Solo essay. References:

Richard Sutton (the interview that triggered the essay; full episode is a separate vault candidate)
Ilya Sutskever (fossil-fuels-as-pre-training-data analogy)
Demis Hassabis / DeepMind (AlphaGo vs AlphaZero)
Unnamed researcher friends (naive fine-tune-on-observed-tokens doesn’t work)

Mapping against Ray Data Co

Medium-strong alignment. This is more inside-baseball ML epistemology than the more directly RDCO-mappable “What are we scaling?” essay. But the underlying frames are useful.

Specific connections:

“Imitation learning is continuous with RL” — useful for any Sanity Check piece pushing back on overly clean ML taxonomies. Real systems aren’t pure of any one paradigm.
Fossil-fuels-as-intermediary frame (Sutskever) — borrowable for talking about technical debt, legacy data infrastructure, or stop-gap solutions that critics dismiss as dead-ends but that are actually the necessary path.
1 bit per episode — striking statistic for any piece on AI sample efficiency or “why training is so expensive for so little incremental capability.”
Continual learning hack via SFT-as-tool-call — track this as a near-term LLM architecture prediction. If realized, this is a Sanity Check news beat.
Meta-RL inversion — beautifully clean frame for explaining the LLM-vs-animal-intelligence gap to non-technical readers. Save for a future explainer piece.

Voice fit: This is the kind of patient, multi-layered, non-tribal writing the founder admires. Dwarkesh refuses both the “Sutton is wrong, LLMs are everything” tribe and the “Sutton is right, RLVR is doomed” tribe. Use as a model for handling polarized AI debates without picking a team prematurely.

Sanity Check candidate hook: “An LLM learns one bit per episode. That’s the AI capability story everyone’s missing.”