06-reference

addy osmani agent harness engineering

Sat May 09 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: X long-form article by @addyosmani ·by Addy Osmani (Director, Google Cloud AI)
agent-harness-engineeringharness-as-disciplineratchet-patterncontext-rothooks-enforcementclaude-mdmcp-tool-designvocabulary-upgraderdco-validation

“Agent Harness Engineering” — @addyosmani (Addy Osmani)

Why this is in the vault

Founder shared 2026-05-09 ~23:07 ET (or 2026-05-10 early AM) without comment. This piece names what RDCO has been doing for nine months without a label for it. Osmani is consolidating a discipline being coined this week by @Vtrivedy10 (“harness engineering”), with @dexhorthy / HumanLayer + Anthropic + Birgitta Böckeler all converging on the same idea. Bookmark-to-like ratio of 2.25:1 (3497 bookmarks / 1556 likes on 229k impressions) is the strongest practitioner-save signal we’ve seen on an agent-architecture piece this month — engineers are filing this for reference, not just liking it.

This is the third same-week piece converging on integration-as-moat at different layers:

Same insight, different altitude. Harness engineering is the most rigorous treatment because it’s the most operationally specific.

The core argument

Agent = Model + Harness. A raw model is not an agent. It only becomes one when a harness provides state, tool execution, feedback loops, and enforceable constraints. If you’re not the model, you’re the harness.

The harness is everything that isn’t the model: system prompts, CLAUDE.md / AGENTS.md / skill files, subagent instructions, tools, MCP servers, sandboxes, headless browsers, orchestration logic, hooks, middleware, observability. Massive surface area — but it’s your surface area, not the model provider’s.

A decent model with a great harness consistently beats a great model with a bad harness. The gap between what today’s models can theoretically do and what you actually see them doing is largely a harness gap.

The Ratchet — every mistake becomes a rule

The most vital habit: treat agent mistakes as permanent signals, not one-off flukes. If the agent ships a PR with a commented-out test, the next AGENTS.md must say “Never comment out tests; delete or fix them.” The pre-commit hook flags .skip( in the diff. The reviewer subagent updates to block it.

Constraints should only be added when you observe a real failure, and removed only when a capable model renders them redundant. Every line in a good system prompt should trace back to a specific historical failure.

This makes harness engineering a discipline, not a one-size-fits-all framework. The right harness for a specific codebase is entirely shaped by its unique failure history.

Working backwards from behavior

Behavior we want → Harness design to achieve it. Every component must have a distinct job. If you cannot name the specific behavior a component exists to deliver, remove it.

Component map (the operational pillars)

ComponentJob
Filesystem + GitDurable state. Workspace to read data, offload intermediate work, multi-agent coordination surface. Git = free versioning.
Bash + code executionGeneral-purpose tooling via the ReAct loop (reason → act → observe → repeat). Agent builds tools on the fly instead of pre-building for every action.
Sandboxes + default toolingIsolated environment to run code safely. Pre-installed runtimes, test CLIs, headless browsers — closes the self-verification loop.
Memory + searchBridge the training-cutoff gap. AGENTS.md / CLAUDE.md inject knowledge per session. Web search + MCP for real-time.
Context-rot mitigationCompaction (summarize older context), tool-call offloading (massive outputs to filesystem, only headers in context), progressive disclosure (load tools only when needed).
Long-horizon executionLoops (intercept exit, force continuation), planning (decompose to step-by-step file with self-verification), splits (separate generation from evaluation to avoid positive bias).
HooksThe enforcement layer. Run at lifecycles (before tool call, after edit, before commit). Block destructive commands, auto-format, run tests. Success silent, failures verbose — typecheck pass = nothing said; typecheck fail = error injected back into loop for self-correction.
Tool designTen focused tools beat fifty overlapping. Tool descriptions populate the prompt — bad MCP servers inject bad prompts before the agent starts.
CLAUDE.md / AGENTS.mdHighest-leverage configuration point in a repo. Treat like a pilot’s checklist, NOT a style guide. Short, every rule earned through past failure.

Harnesses don’t shrink, they move

When models improve, scaffolding doesn’t disappear — it shifts. Better models killed “context-anxiety” mitigations, but unlock new tasks that bring new failure modes. Every component encodes an assumption about what the model can’t do alone. As the floor raises, so does the ceiling. Outdated scaffolding gets removed; new scaffolding gets built for the next horizon.

HaaS — Harness-as-a-Service

The industry is shifting from building on LLM APIs (completions) to building on harness APIs (runtimes). SDKs ship the loop, tools, context management, hooks, sandboxes out of the box. The modern default: pick a harness framework, configure its core pillars, focus purely on domain-specific prompt + tool design. (Cites @FredKSchott’s Flue.)

Mapping against Ray Data Co — STRONG

This is direct vocabulary upgrade for what we’ve been doing without naming. The framework matches our operating model 1:1:

What we already do that maps directly to Osmani’s framework

Where Osmani’s framework names a gap

Convergence with Tobi (Shopify River) and Avedissian (loop-is-moat)

All three pieces dropped within 36h of each other. Same insight at three altitudes:

This is not coincidence. The vibe shift is happening — when models converge, the harder-to-copy substrate (loop / org / harness — depending on layer) is where defensibility lives. Worth tracking as a coherent thesis cluster, not three separate pieces.

Sanity Check candidate (high quality)

Working title: “The harness is the moat (and your CLAUDE.md is your most expensive line of code).”

Original re-frame: every team is racing to upgrade models. The teams getting outsized returns are tightening the harness — the prompts, hooks, subagents, and tool surfaces around the model. RDCO’s nine-month operating record is the proof case: same Claude model the rest of the industry uses, but a harness ratcheted on every failure produces a COO-tier execution surface. The Sanity Check piece would walk through the ratchet pattern with three or four worked examples from RDCO’s /improve history, show how each rule got earned, and argue that most teams are leaving 80% of the model on the table because they have no harness discipline.

Voice match: empirical observation + named discipline + contrarian takeaway (the model isn’t the problem, you are). Founder voice strength.

Tier: high-priority research-brief candidate. If founder green-lights, dispatch /research-brief addy-osmani-harness-engineering.

Notable quotes (≤15 words each, in quotation marks)

Open follow-ups

Source caveat

Article body retrieved via xmcp getPostsById with tweet.fields: ["article", ...] + expansions: ["article.cover_media", "article.media_entities", "author_id"]. Plain text returned full ~2100-word body cleanly. Four embedded media (cover image + 3 diagrams) retrieved as media_keys but not pulled — diagrams referenced in the body include the Claude Code architecture (Fareed Khan’s) and a component-map visual; if Sanity Check piece moves forward, fetch those frames.