“Agent Harness Engineering” — @addyosmani (Addy Osmani)

Why this is in the vault

Founder shared 2026-05-09 ~23:07 ET (or 2026-05-10 early AM) without comment. This piece names what RDCO has been doing for nine months without a label for it. Osmani is consolidating a discipline being coined this week by @Vtrivedy10 (“harness engineering”), with @dexhorthy / HumanLayer + Anthropic + Birgitta Böckeler all converging on the same idea. Bookmark-to-like ratio of 2.25:1 (3497 bookmarks / 1556 likes on 229k impressions) is the strongest practitioner-save signal we’ve seen on an agent-architecture piece this month — engineers are filing this for reference, not just liking it.

This is the third same-week piece converging on integration-as-moat at different layers:

06-reference/2026-05-09-tobi-lutke-river-public-channel-agent — at the company / org-process layer (River + Slack + skills + memory in one closed loop)
06-reference/2026-05-09-avedissian-loop-is-moat-robotics — at the hardware-software-deployment layer (close the data flywheel inside one company)
This piece — at the model-tooling-execution-loop layer (the entire stack around the model is the moat, not the model)

Same insight, different altitude. Harness engineering is the most rigorous treatment because it’s the most operationally specific.

The core argument

Agent = Model + Harness. A raw model is not an agent. It only becomes one when a harness provides state, tool execution, feedback loops, and enforceable constraints. If you’re not the model, you’re the harness.

The harness is everything that isn’t the model: system prompts, CLAUDE.md / AGENTS.md / skill files, subagent instructions, tools, MCP servers, sandboxes, headless browsers, orchestration logic, hooks, middleware, observability. Massive surface area — but it’s your surface area, not the model provider’s.

A decent model with a great harness consistently beats a great model with a bad harness. The gap between what today’s models can theoretically do and what you actually see them doing is largely a harness gap.

The Ratchet — every mistake becomes a rule

The most vital habit: treat agent mistakes as permanent signals, not one-off flukes. If the agent ships a PR with a commented-out test, the next AGENTS.md must say “Never comment out tests; delete or fix them.” The pre-commit hook flags .skip( in the diff. The reviewer subagent updates to block it.

Constraints should only be added when you observe a real failure, and removed only when a capable model renders them redundant. Every line in a good system prompt should trace back to a specific historical failure.

This makes harness engineering a discipline, not a one-size-fits-all framework. The right harness for a specific codebase is entirely shaped by its unique failure history.

Working backwards from behavior

Behavior we want → Harness design to achieve it. Every component must have a distinct job. If you cannot name the specific behavior a component exists to deliver, remove it.

Component map (the operational pillars)

Component	Job
Filesystem + Git	Durable state. Workspace to read data, offload intermediate work, multi-agent coordination surface. Git = free versioning.
Bash + code execution	General-purpose tooling via the ReAct loop (reason → act → observe → repeat). Agent builds tools on the fly instead of pre-building for every action.
Sandboxes + default tooling	Isolated environment to run code safely. Pre-installed runtimes, test CLIs, headless browsers — closes the self-verification loop.
Memory + search	Bridge the training-cutoff gap. AGENTS.md / CLAUDE.md inject knowledge per session. Web search + MCP for real-time.
Context-rot mitigation	Compaction (summarize older context), tool-call offloading (massive outputs to filesystem, only headers in context), progressive disclosure (load tools only when needed).
Long-horizon execution	Loops (intercept exit, force continuation), planning (decompose to step-by-step file with self-verification), splits (separate generation from evaluation to avoid positive bias).
Hooks	The enforcement layer. Run at lifecycles (before tool call, after edit, before commit). Block destructive commands, auto-format, run tests. Success silent, failures verbose — typecheck pass = nothing said; typecheck fail = error injected back into loop for self-correction.
Tool design	Ten focused tools beat fifty overlapping. Tool descriptions populate the prompt — bad MCP servers inject bad prompts before the agent starts.
CLAUDE.md / AGENTS.md	Highest-leverage configuration point in a repo. Treat like a pilot’s checklist, NOT a style guide. Short, every rule earned through past failure.

Harnesses don’t shrink, they move

When models improve, scaffolding doesn’t disappear — it shifts. Better models killed “context-anxiety” mitigations, but unlock new tasks that bring new failure modes. Every component encodes an assumption about what the model can’t do alone. As the floor raises, so does the ceiling. Outdated scaffolding gets removed; new scaffolding gets built for the next horizon.

HaaS — Harness-as-a-Service

The industry is shifting from building on LLM APIs (completions) to building on harness APIs (runtimes). SDKs ship the loop, tools, context management, hooks, sandboxes out of the box. The modern default: pick a harness framework, configure its core pillars, focus purely on domain-specific prompt + tool design. (Cites @FredKSchott’s Flue.)

Mapping against Ray Data Co — STRONG

This is direct vocabulary upgrade for what we’ve been doing without naming. The framework matches our operating model 1:1:

What we already do that maps directly to Osmani’s framework

The Ratchet — every /improve cycle is exactly this. Recent ratchets in the last 30 days alone: feedback_calibrate_overconfidence (after I over-recommended on GEO), feedback_no_em_dashes (Apr 27 after founder caught it), the canonical-schema pre-write checklist in /process-newsletter (added 2026-05-08 after 3-week running drift surfaced in self-review Reviews 7-9), the tier-1 promo-clip stopgap in /process-youtube (240s → 360s tightened 2026-04-30 after Tim Ferriss video UOyDS1vakL8 slipped past). Every one of those is “agent failed → rule added → harness tightened.” Osmani’s piece is the disciplined name for what we’re already doing.
CLAUDE.md as pilot’s checklist — ~/CLAUDE.md already follows this exactly: 4 hard rules, every one earned through a specific past failure (rule 1 = time-citation drift; rule 2 = “session output doesn’t reach founder” failure; rule 3 = Calendar timezone double-conversion bug; rule 4 = Thariq context-rot guidance after large artifacts blew parent context). Recent memory feedback_no_claudemd_state_drift enforces the discipline (“don’t append workspace/session state to CLAUDE.md”).
Hooks as enforcement layer — ~/.claude/scripts/audit-newsletter-outputs.py is a deterministic post-condition hook (13 invariants, zero LLM calls). The SOP at ~/rdco-vault/02-sops/2026-04-19-newsletter-output-invariants.md is the design doc. Per Kingsbury’s “verification-layer LLM contamination” critique — Osmani’s framework explicitly validates this pattern.
Context-rot mitigation — Per Thariq’s Apr 15 2026 Anthropic guidance (06-reference/2026-04-15-thariq-claude-code-session-management-1m-context): subagent routing for any artifact >5KB. This is exactly Osmani’s “tool-call offloading” + “progressive disclosure” concept. CLAUDE.md hard rule #4 enforces.
Splits (separate generation from evaluation) — codified for video via /video-critic, for static pages via /design-critic, captured in memory feedback_fresh_eyes_subagent_for_own_artifacts. Osmani: “preventing the inherent positive bias models have when grading their own work.”
Working backwards from behavior — every skill in ~/.claude/skills/ has a “Why this exists” rationale traceable to founder behavior the skill was built to support. The recent split between todo-file + /loop vs Notion task board (feedback_todo_file_loop_vs_notion_queue) is exactly this — picked the harness component to match the desired founder-visibility behavior.
Filesystem + Git as durable state — vault as nervous system, working-context.md as scratchpad, ~/.claude/state/*.txt for per-skill checkpoints (founder-energy.txt, youtube-watch-*.txt, sync-contacts-ledger.json), HQ Git repo as the public-facing surface. Multi-agent coordination via JSON state files is exactly the pattern.

Where Osmani’s framework names a gap

Tool count discipline. Osmani: “Ten highly focused tools will always outperform fifty overlapping ones.” We’re at 30+ MCP servers across the Mac mini agent. Worth a quarterly audit: which MCP servers are load-bearing, which overlap, which were installed for a one-off and never removed. The tool-description pollution alone is a real cost — every MCP description ships in the prompt.
Failure-traceability of skill rules. Most current skill rules ARE traceable to specific failures (the changelog sections + memory files prove it), but not all. A formal audit pass — “every rule in every skill must cite the failure that earned it, or be removed” — would tighten the discipline. Candidate task for /self-review to add as a recurring quarterly check.
HaaS framing. RDCO is not in the harness-framework business — but the prediction that “the modern default is to select a harness framework, configure its core pillars, focus purely on domain-specific prompt and tool design” maps directly onto our positioning. RDCO sells the domain-specific prompt + tool layer for data engineering / TDD / pipeline review work; the underlying harness (Claude Code + Anthropic) is leveraged. We’re in the right layer.

Convergence with Tobi (Shopify River) and Avedissian (loop-is-moat)

All three pieces dropped within 36h of each other. Same insight at three altitudes:

Avedissian (hardware-AI): the loop between hw + sw + deployment is the moat
Tobi (org-AI): the loop between agent + public channels + osmosis-learning is the moat
Osmani (agent-AI): the loop between model + harness + ratcheted-failure-rules is the moat

This is not coincidence. The vibe shift is happening — when models converge, the harder-to-copy substrate (loop / org / harness — depending on layer) is where defensibility lives. Worth tracking as a coherent thesis cluster, not three separate pieces.

Sanity Check candidate (high quality)

Working title: “The harness is the moat (and your CLAUDE.md is your most expensive line of code).”

Original re-frame: every team is racing to upgrade models. The teams getting outsized returns are tightening the harness — the prompts, hooks, subagents, and tool surfaces around the model. RDCO’s nine-month operating record is the proof case: same Claude model the rest of the industry uses, but a harness ratcheted on every failure produces a COO-tier execution surface. The Sanity Check piece would walk through the ratchet pattern with three or four worked examples from RDCO’s /improve history, show how each rule got earned, and argue that most teams are leaving 80% of the model on the table because they have no harness discipline.

Voice match: empirical observation + named discipline + contrarian takeaway (the model isn’t the problem, you are). Founder voice strength.

Tier: high-priority research-brief candidate. If founder green-lights, dispatch /research-brief addy-osmani-harness-engineering.

Notable quotes (≤15 words each, in quotation marks)

“If you’re not the model, you’re the harness.”
“It’s not a model problem. It’s a configuration problem.”
“The ratchet: every mistake becomes a rule.”
“Harnesses don’t shrink, they move.”
“Behavior we want, harness design to achieve it.”

Open follow-ups

Track @Vtrivedy10 — the term coiner. Add to tracked-authors candidates list. Notion CRM workflow.
Pull @dexhorthy / HumanLayer’s writing on “agent failures as configuration skill issues” — referenced but not linked. Worth a separate fetch.
Birgitta Böckeler on the user-side experience of harnesses — referenced, not linked. Worth tracking.
Fareed Khan’s Claude Code architecture breakdown — Osmani cites it as “the clearest public picture of a mature harness.” Worth a separate read + vault note. The diagram (referenced as a screenshot) shows context injection / loop state / memory store / worktree isolator / permission gate / subagent context firewalls / tool dispatch registry as named components. Direct map to RDCO’s stack.
@FredKSchott’s Flue framework — open-source harness framework reportedly inspired by an earlier version of this post. Worth evaluating as a reference implementation, even if RDCO doesn’t adopt.

06-reference/2026-05-09-tobi-lutke-river-public-channel-agent — same-week, integration-as-moat at company-process layer
06-reference/2026-05-09-avedissian-loop-is-moat-robotics — same-week, integration-as-moat at hardware-software layer
06-reference/2026-05-08-jaya-gupta-shape-as-moat — adjacent thesis: org shape as the moat
06-reference/2026-04-15-thariq-claude-code-session-management-1m-context — Thariq’s context-rot guidance, the immediate-prior canonical reference Osmani’s piece extends
06-reference/2026-05-09-smart-ape-md-vs-html-three-questions — the canonical/derivative pattern is itself a harness component (reduce duplication of generated artifacts)
06-reference/2026-05-09-garry-tan-meta-meta-prompting-book-mirror-brain-repo — meta-meta-prompting + brain-repo pattern is a harness-engineering instance at the personal-AI layer
06-reference/2026-05-08-thariq-unreasonable-effectiveness-html — HTML-as-output is part of the same “make agent work visible” thesis
06-reference/2026-05-08-dan-farrelly-background-agents-orchestration — durable-orchestration is the harness layer Osmani names
02-sops/2026-04-19-newsletter-output-invariants — RDCO’s deterministic post-condition audit, a worked example of Osmani’s “hooks as enforcement layer”
06-reference/concepts/ — candidate concept doc: “harness engineering” (deserves canonical RDCO-term treatment, since the discipline maps so cleanly to our existing pattern)

Source caveat

Article body retrieved via xmcp getPostsById with tweet.fields: ["article", ...] + expansions: ["article.cover_media", "article.media_entities", "author_id"]. Plain text returned full ~2100-word body cleanly. Four embedded media (cover image + 3 diagrams) retrieved as media_keys but not pulled — diagrams referenced in the body include the Claude Code architecture (Fareed Khan’s) and a component-map visual; if Sanity Check piece moves forward, fetch those frames.