06-reference

dwarkesh richard sutton rl llm dead end

Sat Apr 18 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: Dwarkesh Patel YouTube ·by Dwarkesh Patel, Richard Sutton
dwarkeshrichard-suttonreinforcement-learningllms-are-wrongera-of-experienceworld-modelstransferturing-awardcontinual-learningagi

Richard Sutton on Dwarkesh Patel — Father of RL Thinks LLMs Are a Dead End

Why this is in the vault

Sutton just won the 2025 Turing Award. He is the most credentialed RL voice alive, and on this episode he flatly states that LLMs are not on the path to AI — they are imitation systems, not learning-from-experience systems, and the field has confused mimicry of humans with intelligence. Whether or not he’s right, this is the highest-status dissent against the LLM-scaling consensus in 2025–26, and it’s a citation we need on hand. Pairs head-to-head with the Ilya Sutskever episode (same processing cycle), which preserves the LLM substrate but argues for continual learning on top — making the two episodes a perfect “what comes after scaling” diptych.

Core argument

  1. LLMs imitate; they do not understand. A world model lets you predict what will happen. LLMs predict what a person would say. Sutton: “they have the ability to predict what a person would say. They don’t have the ability to predict what will happen.”
  2. The “good prior” defense of LLMs is wrong. The standard argument — “imitation gives a good prior, then we RL on top” — assumes the prior is in the right shape. Sutton denies this: imitating people doesn’t put the model in a state where reward signal can refine it into a learning agent.
  3. Real intelligence requires four pieces and LLMs only have one. (a) policy, (b) value function (track if reward is going up or down), (c) perception/state representation, (d) transition model of the world — what will happen if I do this. The fourth is the missing one and the most important one. Pre-training has none of them in the right form.
  4. Era of experience. Following on his “Bitter Lesson” lineage, Sutton argues the next phase of AI must learn from the actual stream of sensorimotor experience, not from a static text corpus. Reward is part of that stream but a small part — most learning is unsupervised modeling of “I did X, then Y happened.”
  5. Transfer is across states, not across tasks. Critique of the field: we set up benchmarks as separate tasks (chess, Go, Atari) and then complain RL doesn’t transfer. Wrong frame. A real agent has one world; “tasks” are just states in that world. The MuZero/AlphaZero family was deliberately not built for cross-task transfer; that’s an engineering limit, not a paradigmatic one.
  6. Continual learning is non-negotiable. The architecture must learn on the job, every step, in the world it’s deployed in. Period.
  7. Alignment via “high-integrity” upbringing, not value imposition. Strikingly similar to Ilya’s parenting analogy: don’t dictate ends; instill steerable values. Sutton frames it as voluntary-rather-than-imposed change at civilizational scale.

Mapping against RDCO

Open follow-ups