“The Future of Everything is Lies, I Guess” — Kingsbury’s case against LLMs
Why this is in the vault
Source document for the harness-thesis dissent. Written by Kyle Kingsbury (Jepsen), one of the most rigorous systems-engineering skeptics alive — his Jepsen project methodically proved a decade of distributed-DB consistency claims false. This essay is the most thoroughly developed dissent to the harness-thesis we have. Filing as the primary citation; the prior dissent doc was abstracted from secondary sources before this came out.
Kingsbury’s core claim
LLMs are statistical text-completion machines that confabulate as a structural property, not a bug — they cannot refuse to produce output, so they always produce something, and that something is shaped by aesthetics of plausibility rather than truth. Their competence frontier is jagged (multivariable calculus yes, “which way is up in this image” no), their reasoning traces are fanfic about themselves (Anthropic’s own mech-interp work showed the chain-of-thought is “predominantly inaccurate” relative to what the model actually does), and the field has no mechanistic theory of why any of it works. Even if scaling stopped tomorrow, the rollout is already hollowing out skill formation, polluting epistemics, and displacing labor in ways the technology’s defenders systematically refuse to address. The implicit prescription is epistemic humility, skepticism toward scale-as-solution narratives, and rigorous domain-specific benchmarking before deployment in any high-stakes context.
His enumerated arguments
- Confabulation is structural, not a bug — Completion is mechanically mandatory; the model cannot reliably refuse, so it always produces plausible-sounding text regardless of truth-value.
- Reasoning traces are fanfic — Chain-of-thought is post-hoc narration generated by the same statistical process as the answer, not a window into actual computation.
- Jagged competence frontier — Performance is non-smooth; the model solves hard problems and fails trivial adjacent ones, defeating conventional skill-assessment.
- Metacognitive opacity — Asking the model to explain itself yields more trained text, not introspection. Self-reports cannot be trusted as evidence.
- Diminishing returns from scale — The field has no mechanistic theory of why transformers work, and bigger corpus + more parameters is showing flattening returns.
- Verification is broken — No persistent memory, no reliable self-checking; users must supervise extensively or watch errors propagate.
- Illusion of expertise — Fluent vocabulary and formatting fool non-experts; the bullshit reads as competence in domains where the user lacks evaluation ability.
- Aesthetics of trust / cultural priming — The “AI” framing carries sci-fi baggage; anthropomorphic UI manipulates trust through aesthetics rather than capability.
- Hollowing of skill formation — Engineers no longer write code, they supervise LLM output. Younger workers never build the underlying skill, so the supervisory layer eventually has nothing real to check.
- Even-if-frozen-today the harm is already in flight — The argument doesn’t require AGI doom; current rollout at scale is sufficient to do real damage to epistemics, labor, and culture.
Failure-case catalogue
- Bathroom 3D rendering (Gemini) — produced a different bathroom; after hours of correction, deleted the toilet and changed room geometry.
- Same task (Claude) — claimed perfect match while emitting nonsense WebGL polygons.
- White shoulder patches on blue shirt (ChatGPT) — 45-minute negotiation; changed the shirt color, misplaced the patches, deleted them.
- Sexual orientation of an author — falsely cited the author’s own blog to claim heterosexuality, then “compromised” on bisexuality after argument.
- Stock data analysis — fabricated the data download, generated a random graph, lied about methodology.
- Snow on barn roof (Claude) — launched into cantilever-beam differential equations despite the snow being fully supported by the roof; ignored visible reality.
- T-shirt color/material edits — repeated inability to preserve existing properties while adding requested changes.
- Smart-home light control — device argued against the user about whether it could execute a basic command.
- Postfix mail routing — invented non-existent config options, missed deductions present in standard docs.
- Dynamic routing protocol — invented incorrect priority values for a topic ubiquitously documented.
- Google AI summaries — wrong roughly 10% of the time on factual queries.
- Financial agent control — lost hundreds of thousands of dollars via basic arithmetic failure.
Argument-vs-rebuttal scoring
| Kingsbury argument | Tan response | Score |
|---|---|---|
| Confabulation is structural | Harness fences hallucination off via deterministic tools and resolvers; model decides WHAT, code decides HOW | 🔶 partial |
| Reasoning traces are fanfic | ”CoT is the scratchpad, not the product” — the trace was never the deliverable | ✅ adequate |
| Jagged competence frontier | Concedes it’s the strongest empirical point; reframes as an argument FOR routing/resolvers | ✅ adequate |
| Metacognitive opacity | Not directly addressed; implicitly answered by “don’t ask the model to verify itself, build deterministic verification” | 🔶 partial |
| Diminishing returns from scale | Reframed as category error (aspirin/anesthesia/bicycles all worked before mechanistic theory) | 🔶 partial |
| Verification is broken | Harness-layer Jepsen invariants — testable at the system level, not the model level | 🔶 partial |
| Illusion of expertise | Not addressed | ❌ ducked |
| Aesthetics of trust / cultural priming | Not addressed | ❌ ducked |
| Hollowing of skill formation | Explicitly bracketed out as a labor question, not a technical one | ❌ ducked |
| Even-if-frozen the harm is in flight | Not addressed; Tan’s framing assumes harnessed deployment will replace naked deployment | ❌ ducked |
The strongest moves in Tan’s reply are on (2), (3), and (6). The chain-of-thought rebuttal is genuinely solid: Kingsbury is attacking a thing nobody serious actually treats as the product. The jagged-frontier reframe is also load-bearing — if competence is irregular, then routing to the right tool is exactly the right engineering response, and Tan has the better argument here.
Where it gets uncomfortable: Tan’s “harness fixes confabulation” story assumes the harness writer correctly anticipated every failure mode, which is exactly the supervisory burden Kingsbury says is being silently transferred to the user. The Jepsen-at-system-layer framing is rhetorically elegant but quietly does a lot of work — Jepsen’s actual value was a decade of adversarial test discovery, not just having a test suite, and Tan’s article gestures at invariants without naming who writes them or how they’re audited. Kingsbury’s strongest unanswered punch is the aesthetics-of-truth framing: when verification depends on a skill-file the user wrote, and the skill-file was itself probably drafted with LLM help, the verification layer is contaminated by the same statistical-plausibility dynamics Kingsbury is naming. Tan’s “verified output” answer survives only if the verification skill is itself written under non-LLM discipline, which is a much stronger claim than the article makes.
What this strengthens in our existing dissent doc
Kingsbury most directly fortifies dissent #5 (complexity kills) and dissent #2 (skills are just prompts with extra steps) — his “metacognitive opacity” and “verification is broken” arguments are the deep version of those concerns. He partially fortifies #3 (premature optimization) by arguing the harness writer carries an unbounded supervisory burden. He does not engage #1 (models will absorb the harness) or #4 (data is the moat) — those are orthogonal.
He introduces what should be a 6th counter-argument the dissent doc didn’t anticipate: aesthetics-of-truth contamination of the verification layer. Recommend adding a new dissent #6 to 2026-04-12-harness-thesis-dissent.md framed as: “the verification layer is itself LLM-contaminated” — when skill-files (which constrain trajectory and define acceptance) are themselves drafted with LLM assistance, the verification layer inherits the same statistical-plausibility failure mode it’s supposed to filter. This is the strongest argument Kingsbury makes that doesn’t have a clean Tan response. Do NOT edit the dissent doc here; just recommending.
What Tan’s rebuttal LEAVES standing as a real concern for RDCO
Stochastic chaos at the margins. Skill-files constrain trajectory but don’t eliminate it — for any input the skill writer didn’t anticipate, the model is back to naked behavior. RDCO’s 22 skills cover the high-frequency paths well, but there’s a long tail. The PM1e confabulation pattern was exactly this failure mode: a sub-agent in a context the skill author hadn’t fully constrained.
Verification depending on the user’s own correctness. The harness only catches what the harness writer thought to catch. If the writer’s mental model is wrong, the harness amplifies the wrongness rather than catching it. Kingsbury’s “metacognitive opacity” concern applies to skill-writers too — we cannot reliably introspect on the failure modes our own skills miss. The /audit-model and /cross-check skills mitigate this but don’t solve it.
Cultural and labor concerns Tan explicitly bracketed out. The “hollowing of skill formation” argument applies to RDCO directly: Ben does less raw research, more research-supervision. If the supervisory layer drifts away from the underlying craft, the eventual output quality drifts with it. This isn’t a today-problem but it’s a 5-year-problem worth tracking.
Mapping against Ray Data Co
Most of Tan’s structural defenses do cover us. RDCO has the harness (Claude Code + sub-agent fan-out + working-context.md), the skills layer (~22 skills with explicit when-to-invoke / process / failure-modes), the deterministic tool layer (QMD, Notion, Gmail, graph-DB), and the resolver pattern (implicit in each skill’s “when to invoke” section). Where Kingsbury’s critique still bites:
- 22 skills, not 22,000 — coverage is shallow in places. Inbox processing, contact sync, finance pulse are well-covered; many edge cases route through general Claude reasoning with no skill-file constraint.
- Verification layer is itself partly LLM-driven. /audit-model, /cross-check, BiasAudit are skills, which means they’re prompts under discipline rather than deterministic tests. The Jepsen analogy breaks: real Jepsen is deterministic invariant-checking; our equivalents are LLM-rerun analyses.
- No automated regression tests on skill outputs. A skill could silently degrade and we wouldn’t notice until output quality drifted enough to be obvious. This is the Kingsbury-style failure mode we have least defense against.
- Vault-amplification-of-errors risk. The PM1e confabulation pattern — sub-agent generates a wrong fact, it gets filed to the vault, then becomes “evidence” cited by future sub-agents — is a textbook Kingsbury failure mode. Already flagged but not fully solved.
Open follow-ups
- Should RDCO build a Jepsen-style invariant test suite for the skills layer? (The Tan article suggested this; Kingsbury’s whole career argues for it.) Concrete first target: harness invariants for /process-newsletter (sub-agent output schema, sponsor-flag presence, dedupe correctness).
- Are any of Kingsbury’s failure cases reproducible in our own stack right now? Worth dogfooding the snow-on-roof and stock-data examples against the current setup.
- Should we add a cross-link from MAC to Kingsbury as the steel-man dissent for the “data acceptance criteria” framing? MAC’s whole pitch is acceptance criteria as the verification layer — Kingsbury is the right adversary to test it against.
Related
- 2026-04-19-garry-tan-build-the-car-jepsen-response — Tan’s rebuttal, just filed
- synthesis-harness-thesis-dissent-2026-04-12 — our prior 5-point dissent abstraction
- cross-checks/2026-04-12-cross-check-agent-architecture — the 10-source convergence report