“Build the Car” — Garry Tan’s response to Kyle Kingsbury

Why this is in the vault

Second article in Tan’s harness series; hardens the same thesis against a specific high-credibility critic (Kyle Kingsbury / Jepsen). Names a concrete open-source triad (OpenClaw / GBrain / GStack) that maps directly onto RDCO’s existing architecture. Brings the harness-thesis convergence count to 10+ independent sources and introduces a new angle — open-source as a prerequisite for real verification — that no prior cluster member named explicitly.

The core argument

Kyle Kingsbury (the engineer behind the Jepsen distributed-database test suites — one of the most respected names in correctness testing) published a 32-page essay catalogging LLM failures and concluding LLMs are “bullshit machines incapable of producing trustworthy output.” Tan concedes every failure Kingsbury documented is real. His objection is structural: Kingsbury tested raw models with no harness, no skill files, no deterministic tools, no resolver — and then concluded the entire technology was unsafe. Tan’s metaphor: testing a car engine on a bench, watching it fail to navigate traffic, and writing a paper declaring cars unsafe. The model is the engine. The harness is the car.

Tan walks Kingsbury’s failure modes one by one. The bathroom 3D-rendering failure (Gemini producing nonsense) is what happens when a text predictor with bolted-on image features is asked to do CAD work; a harnessed system would decompose into a vision model identifying surfaces, a model picking materials, a deterministic image-processing tool (Pillow / OpenCV / Blender) applying them, and a deterministic comparison verifying geometry. The hallucinated stock-data failure is what happens when a model with no HTTP client is asked to produce numbers it cannot fetch; the fix is a deterministic tool that calls a real stock API — model decides WHAT to look up, code decides HOW. The recent 512K-line Claude Code source leak is, in Tan’s reading, evidence that even Anthropic doesn’t trust the model naked.

The “jagged frontier” — Kingsbury’s sharpest observation about LLMs being unpredictably good at adjacent tasks — Tan treats as the strongest empirical point in the essay but the wrong conclusion. Irregularity is an argument FOR routing (resolvers that dispatch to the right tool/skill), not against AI generally. Prompt sensitivity (“chaos”) is real for naked input and goes away when input is constrained by structured skill files (~200-line markdown documents that constrain trajectory). Chain-of-thought traces being “fanfic about themselves” is true but irrelevant — CoT is the scratchpad, not the product. The fact that “we don’t know why transformers work” is treated as a category error: aspirin’s mechanism wasn’t understood until 1971, anesthesia is still mostly empirical, and bicycle stability wasn’t formally explained until 2011. Practical utility doesn’t require theoretical completeness.

Tan’s reframe of Jepsen methodology is the load-bearing move. Apply it at the system layer, not the model layer. Don’t ask “is the LLM correct?” Ask: does the harness prevent hallucinated data from reaching the user, do resolvers fire on the right inputs, does entity propagation complete across documents. These are testable invariants in the Jepsen sense. He closes by arguing open source matters because the user must control the verification layer — closed-source agents (API-only) block the skill-writing depth that real domain-specific verification needs — and names his triad: OpenClaw (harness), GBrain (knowledge), GStack (skills), all open-sourced. “Build the car.”

What’s new vs the Apr 11 piece

Open-source-as-verification-prerequisite argument — closed-source agents can’t expose the skill-writing depth that real domain-specific verification needs. Direct support for RDCO’s fully-local Mac Mini stack over API-only architectures.
Jepsen methodology applied at system layer — testable invariants for harnessed AI (harness prevents hallucinated data, resolvers fire correctly, entity propagation completes). Useful framing for /audit-model + /generate-tests + future test design.
Named triad as concrete projects — OpenClaw (harness) / GBrain (knowledge) / GStack (skills). Maps to RDCO’s claude-code-as-harness / vault+QMD+graph / 22+ skills catalogue.
Honest about failure modes — “the skill might decompose the task wrong, the vision model might misidentify a surface” — Tan explicitly admits harnesses fail. Good editorial bar to mirror in our content.
Scratchpad-vs-answer distinction for reasoning models — addresses the “Anthropic admits chain-of-thought doesn’t reflect actual reasoning” critique by saying the trace was never the product.

Mapping against Ray Data Co

Tan’s triad maps almost cleanly onto what RDCO already runs. OpenClaw (harness) corresponds to Claude Code as the orchestration shell plus the sub-agent fan-out pattern (process-newsletter batch mode, deep-research nightly question dispatch), the scheduled-jobs.txt cron layer, the working-context.md durable scratchpad surviving compactions, and the bridge-notes hooks that pass state between sessions. GBrain (knowledge) corresponds to ~/rdco-vault/ (1668 docs), QMD’s lex/vec/hyde three-mode search, and the DuckDB knowledge graph (4160 vertices, 7984 edges, with the Phase 2 LLM annotations producing typed cluster/contradicts/validates edges). GStack (skills) corresponds to ~/.claude/skills/ — 22+ skills, each with explicit when-to-invoke / process / failure-modes structure exactly as Tan describes.

What RDCO has that Tan’s stack doesn’t visibly include: typed knowledge-graph edges (validates / contradicts / cites are first-class, not just “linked”), the vault-to-graph reingest pipeline that runs nightly, the curiosity → deep-research feedback loop where weak spots in the graph propose their own follow-up reading, and the weekly /self-review + /improve cadence that audits and rewrites the skills themselves — the literal “fat skills” recursion Tan’s first article called for. What Tan’s stack might have that RDCO doesn’t yet: the named “resolver” pattern as a distinct artifact. RDCO has implicit dispatch logic inside each skill’s “when to invoke” section, but no centralized routing table you could point to and say “this is the resolver.” Worth investigating whether to formalize.

Implication for the upcoming MAC content series: the “open source enables verification” point is direct ammunition for “MAC is a framework you can run, not a vendor capability you rent” positioning. Tan just made the argument for us in the language of one of the most respected correctness engineers in the industry.

Where this strengthens the harness-thesis cluster

Convergence count now 10+ independent sources (Cobus Greyling weights→context→harness shift, Arxiv 2604.08224 externalization paper, Harrison Chase memory-as-moat, Jonathan Natkins data-layer-does-the-work, Fredrik Lindström governance, prior Tan piece, today’s Tan piece, plus the additional sources catalogued in the Apr 12 cross-check)
New angle (verification-layer / open-source) hadn’t been explicitly named in prior cluster members
Tan’s Jepsen-at-system-layer framing gives RDCO a testing-discipline vocabulary that synthesizes with MAC’s acceptance-criteria framing — they share an underlying claim about engineering against unreliable inputs

Open follow-ups

Has anyone built a publicly-runnable Jepsen-style test suite for a harnessed AI agent? Worth a deep-research question.
What does Tan’s “resolver” actually look like — a markdown document, a Python dispatcher, a YAML routing table? RDCO has implicit routing in each skill’s “when to invoke” section but no central artifact. Worth comparing once OpenClaw is public.
How does GBrain handle entity propagation across documents? RDCO’s graph DB has typed edges but doesn’t (yet) explicitly track entity flow as a first-class invariant.

2026-04-11-garry-tan-thin-harness-fat-skills — the first article in Tan’s series
synthesis-harness-thesis-dissent-2026-04-12 — 5 counter-arguments to the harness thesis (Tan’s article addresses several)
cross-checks/2026-04-12-cross-check-agent-architecture — the 10-source convergence report