Richard Sutton on Dwarkesh Patel — Father of RL Thinks LLMs Are a Dead End
Why this is in the vault
Sutton just won the 2025 Turing Award. He is the most credentialed RL voice alive, and on this episode he flatly states that LLMs are not on the path to AI — they are imitation systems, not learning-from-experience systems, and the field has confused mimicry of humans with intelligence. Whether or not he’s right, this is the highest-status dissent against the LLM-scaling consensus in 2025–26, and it’s a citation we need on hand. Pairs head-to-head with the Ilya Sutskever episode (same processing cycle), which preserves the LLM substrate but argues for continual learning on top — making the two episodes a perfect “what comes after scaling” diptych.
Core argument
- LLMs imitate; they do not understand. A world model lets you predict what will happen. LLMs predict what a person would say. Sutton: “they have the ability to predict what a person would say. They don’t have the ability to predict what will happen.”
- The “good prior” defense of LLMs is wrong. The standard argument — “imitation gives a good prior, then we RL on top” — assumes the prior is in the right shape. Sutton denies this: imitating people doesn’t put the model in a state where reward signal can refine it into a learning agent.
- Real intelligence requires four pieces and LLMs only have one. (a) policy, (b) value function (track if reward is going up or down), (c) perception/state representation, (d) transition model of the world — what will happen if I do this. The fourth is the missing one and the most important one. Pre-training has none of them in the right form.
- Era of experience. Following on his “Bitter Lesson” lineage, Sutton argues the next phase of AI must learn from the actual stream of sensorimotor experience, not from a static text corpus. Reward is part of that stream but a small part — most learning is unsupervised modeling of “I did X, then Y happened.”
- Transfer is across states, not across tasks. Critique of the field: we set up benchmarks as separate tasks (chess, Go, Atari) and then complain RL doesn’t transfer. Wrong frame. A real agent has one world; “tasks” are just states in that world. The MuZero/AlphaZero family was deliberately not built for cross-task transfer; that’s an engineering limit, not a paradigmatic one.
- Continual learning is non-negotiable. The architecture must learn on the job, every step, in the world it’s deployed in. Period.
- Alignment via “high-integrity” upbringing, not value imposition. Strikingly similar to Ilya’s parenting analogy: don’t dictate ends; instill steerable values. Sutton frames it as voluntary-rather-than-imposed change at civilizational scale.
Mapping against RDCO
- This is the strongest available counter-citation when our writing leans too heavily on LLM-centrism. Editorial discipline: every Sanity Check issue that asserts “the LLM-as-substrate is given” should at least be aware of this dissent. Doesn’t have to engage every time, but cite when stakes are high.
- The “imitation vs experience” distinction is operationally useful for product writing. When evaluating an AI product claim, ask: does this thing learn from what actually happens when it acts, or does it just mimic patterns of past human action? Most “AI agent” products in 2026 fail that test. Strong material for a Sanity Check breakdown of agentic product hype.
- Sutton’s four-component frame (policy/value/perception/world-model) is a useful product audit checklist. For RDCO’s own COO-agent build, we should be able to point at where each lives. If three of four are missing, we are — by Sutton’s definition — building a parrot.
- Transfer is across states, not tasks is a deeply useful reframe for thinking about how to design Ray’s own learning loop — instead of “skill modules,” think “one persistent agent state space with state-to-state generalization.” File against ~/rdco-vault/01-projects/ray-as-coo/architecture-notes.md.
- Pair with Ilya for the diptych essay. Both agree: continual learning is the missing piece, the parent-child alignment analogy is the right one, scaling-as-recipe is over. They disagree on substrate: Ilya keeps the LLM as foundation and adds continual learning; Sutton wants to throw out imitation entirely and start from RL-from-experience. That single disagreement is the next-paradigm question. Candidate Sanity Check title: “Two Turing-grade dissents from the scaling consensus — and what they agree on.”
- Caveat — Sutton has been making some version of this argument for 20+ years. The “Bitter Lesson” was 2019; this is the 2025 update. He is a true believer in RL-from-experience and discounts evidence that contradicts him. Treat the position as the strongest possible articulation of one camp, not as ground truth.
Open follow-ups
- Build an explicit Sutton-vs-Sutskever comparison table — substrate, learning signal, continual-learning role, alignment posture, time-to-AGI implication. This would make a clean Data Dot or sidebar.
- Sutton’s “transfer between states, not tasks” claim — does this hold against the recent Anthropic interpretability findings on cross-task feature reuse? Vault probably has the relevant paper notes; flag for cross-check.
- Sutton on “we’re not seeing transfer anywhere” is a load-bearing empirical claim. Either it’s true (in which case the LLM moat is shakier than priced) or it’s false (in which case Sutton is overclaiming). Curiosity-skill candidate.
- “Voluntary rather than imposed change” appears in both Ilya and Sutton in the same week. Worth tracing — is this becoming a consensus alignment principle among senior figures?
Related
- ~/rdco-vault/06-reference/2026-04-19-dwarkesh-ilya-sutskever-age-of-research.md — companion episode, complementary-but-conflicting view
- ~/rdco-vault/02-strategy/positioning/harness-thesis.md — Sutton’s four-component frame is a harness-design checklist
- ~/rdco-vault/06-reference/transcripts/2026-04-19-dwarkesh-richard-sutton-rl-llm-dead-end-transcript.md — full transcript
- Sutton, The Bitter Lesson (2019) — original statement of the position that this podcast updates