06-reference

kingsbury future of everything is lies

Sat Apr 18 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: aphyr.com / Kyle Kingsbury (Jepsen) ·by Kyle Kingsbury

“The Future of Everything is Lies, I Guess” — Kingsbury’s case against LLMs

Why this is in the vault

Source document for the harness-thesis dissent. Written by Kyle Kingsbury (Jepsen), one of the most rigorous systems-engineering skeptics alive — his Jepsen project methodically proved a decade of distributed-DB consistency claims false. This essay is the most thoroughly developed dissent to the harness-thesis we have. Filing as the primary citation; the prior dissent doc was abstracted from secondary sources before this came out.

Kingsbury’s core claim

LLMs are statistical text-completion machines that confabulate as a structural property, not a bug — they cannot refuse to produce output, so they always produce something, and that something is shaped by aesthetics of plausibility rather than truth. Their competence frontier is jagged (multivariable calculus yes, “which way is up in this image” no), their reasoning traces are fanfic about themselves (Anthropic’s own mech-interp work showed the chain-of-thought is “predominantly inaccurate” relative to what the model actually does), and the field has no mechanistic theory of why any of it works. Even if scaling stopped tomorrow, the rollout is already hollowing out skill formation, polluting epistemics, and displacing labor in ways the technology’s defenders systematically refuse to address. The implicit prescription is epistemic humility, skepticism toward scale-as-solution narratives, and rigorous domain-specific benchmarking before deployment in any high-stakes context.

His enumerated arguments

  1. Confabulation is structural, not a bug — Completion is mechanically mandatory; the model cannot reliably refuse, so it always produces plausible-sounding text regardless of truth-value.
  2. Reasoning traces are fanfic — Chain-of-thought is post-hoc narration generated by the same statistical process as the answer, not a window into actual computation.
  3. Jagged competence frontier — Performance is non-smooth; the model solves hard problems and fails trivial adjacent ones, defeating conventional skill-assessment.
  4. Metacognitive opacity — Asking the model to explain itself yields more trained text, not introspection. Self-reports cannot be trusted as evidence.
  5. Diminishing returns from scale — The field has no mechanistic theory of why transformers work, and bigger corpus + more parameters is showing flattening returns.
  6. Verification is broken — No persistent memory, no reliable self-checking; users must supervise extensively or watch errors propagate.
  7. Illusion of expertise — Fluent vocabulary and formatting fool non-experts; the bullshit reads as competence in domains where the user lacks evaluation ability.
  8. Aesthetics of trust / cultural priming — The “AI” framing carries sci-fi baggage; anthropomorphic UI manipulates trust through aesthetics rather than capability.
  9. Hollowing of skill formation — Engineers no longer write code, they supervise LLM output. Younger workers never build the underlying skill, so the supervisory layer eventually has nothing real to check.
  10. Even-if-frozen-today the harm is already in flight — The argument doesn’t require AGI doom; current rollout at scale is sufficient to do real damage to epistemics, labor, and culture.

Failure-case catalogue

Argument-vs-rebuttal scoring

Kingsbury argumentTan responseScore
Confabulation is structuralHarness fences hallucination off via deterministic tools and resolvers; model decides WHAT, code decides HOW🔶 partial
Reasoning traces are fanfic”CoT is the scratchpad, not the product” — the trace was never the deliverable✅ adequate
Jagged competence frontierConcedes it’s the strongest empirical point; reframes as an argument FOR routing/resolvers✅ adequate
Metacognitive opacityNot directly addressed; implicitly answered by “don’t ask the model to verify itself, build deterministic verification”🔶 partial
Diminishing returns from scaleReframed as category error (aspirin/anesthesia/bicycles all worked before mechanistic theory)🔶 partial
Verification is brokenHarness-layer Jepsen invariants — testable at the system level, not the model level🔶 partial
Illusion of expertiseNot addressed❌ ducked
Aesthetics of trust / cultural primingNot addressed❌ ducked
Hollowing of skill formationExplicitly bracketed out as a labor question, not a technical one❌ ducked
Even-if-frozen the harm is in flightNot addressed; Tan’s framing assumes harnessed deployment will replace naked deployment❌ ducked

The strongest moves in Tan’s reply are on (2), (3), and (6). The chain-of-thought rebuttal is genuinely solid: Kingsbury is attacking a thing nobody serious actually treats as the product. The jagged-frontier reframe is also load-bearing — if competence is irregular, then routing to the right tool is exactly the right engineering response, and Tan has the better argument here.

Where it gets uncomfortable: Tan’s “harness fixes confabulation” story assumes the harness writer correctly anticipated every failure mode, which is exactly the supervisory burden Kingsbury says is being silently transferred to the user. The Jepsen-at-system-layer framing is rhetorically elegant but quietly does a lot of work — Jepsen’s actual value was a decade of adversarial test discovery, not just having a test suite, and Tan’s article gestures at invariants without naming who writes them or how they’re audited. Kingsbury’s strongest unanswered punch is the aesthetics-of-truth framing: when verification depends on a skill-file the user wrote, and the skill-file was itself probably drafted with LLM help, the verification layer is contaminated by the same statistical-plausibility dynamics Kingsbury is naming. Tan’s “verified output” answer survives only if the verification skill is itself written under non-LLM discipline, which is a much stronger claim than the article makes.

What this strengthens in our existing dissent doc

Kingsbury most directly fortifies dissent #5 (complexity kills) and dissent #2 (skills are just prompts with extra steps) — his “metacognitive opacity” and “verification is broken” arguments are the deep version of those concerns. He partially fortifies #3 (premature optimization) by arguing the harness writer carries an unbounded supervisory burden. He does not engage #1 (models will absorb the harness) or #4 (data is the moat) — those are orthogonal.

He introduces what should be a 6th counter-argument the dissent doc didn’t anticipate: aesthetics-of-truth contamination of the verification layer. Recommend adding a new dissent #6 to 2026-04-12-harness-thesis-dissent.md framed as: “the verification layer is itself LLM-contaminated” — when skill-files (which constrain trajectory and define acceptance) are themselves drafted with LLM assistance, the verification layer inherits the same statistical-plausibility failure mode it’s supposed to filter. This is the strongest argument Kingsbury makes that doesn’t have a clean Tan response. Do NOT edit the dissent doc here; just recommending.

What Tan’s rebuttal LEAVES standing as a real concern for RDCO

Stochastic chaos at the margins. Skill-files constrain trajectory but don’t eliminate it — for any input the skill writer didn’t anticipate, the model is back to naked behavior. RDCO’s 22 skills cover the high-frequency paths well, but there’s a long tail. The PM1e confabulation pattern was exactly this failure mode: a sub-agent in a context the skill author hadn’t fully constrained.

Verification depending on the user’s own correctness. The harness only catches what the harness writer thought to catch. If the writer’s mental model is wrong, the harness amplifies the wrongness rather than catching it. Kingsbury’s “metacognitive opacity” concern applies to skill-writers too — we cannot reliably introspect on the failure modes our own skills miss. The /audit-model and /cross-check skills mitigate this but don’t solve it.

Cultural and labor concerns Tan explicitly bracketed out. The “hollowing of skill formation” argument applies to RDCO directly: Ben does less raw research, more research-supervision. If the supervisory layer drifts away from the underlying craft, the eventual output quality drifts with it. This isn’t a today-problem but it’s a 5-year-problem worth tracking.

Mapping against Ray Data Co

Most of Tan’s structural defenses do cover us. RDCO has the harness (Claude Code + sub-agent fan-out + working-context.md), the skills layer (~22 skills with explicit when-to-invoke / process / failure-modes), the deterministic tool layer (QMD, Notion, Gmail, graph-DB), and the resolver pattern (implicit in each skill’s “when to invoke” section). Where Kingsbury’s critique still bites:

Open follow-ups