06-reference / research

agent harness landscape

Sat May 09 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·research-brief ·source: deep-research
harness-engineeringcompetitive-landscaperay-as-a-servicebootstrapproductization

Agent Harness Landscape - May 2026

The question

Verbatim from founder, 2026-05-10 11:34 ET, after reading Addy Osmani’s harness-engineering piece:

“Shopify rolled their own [River]. I guess that’s what I’m trying to find. Is the customization/personalization so important that everyone will need to roll their own or how far can you bootstrap the setup with a productive solution. Ray is really just a thin wrapper around Claude Code.”

Direct strategic input to the Ray-as-a-Service / Ray-Starter-Kit bet decision.

What we already know (from the vault)

The market today

The harness market in May 2026 has a clear shape: a small set of “general-purpose” coding/work harnesses (Claude Code, Cursor, OpenAI Codex CLI, Cline, Aider, Continue, OpenHands), a young “personal AI” harness category (Hermes Agent from Nous Research), an academic reference (SWE-agent), and a growing pattern of teams building thin orchestration ON TOP of one of these (Shopify’s Roast wraps Claude Code, not from scratch). All of them ship the same universal-layer kit: a loop, tools (file / bash / search), some sandbox boundary, MCP for external integrations, hooks or events, and a markdown-rules file for personalization. The differentiation is not “what’s in the box” - it’s “what’s the bootstrap floor” and “what does the personal-fit layer look like.”

HarnessUniversal layer (shipped)Personal-fit layer (operator)Bootstrap floorNotable rolled-their-own escapes
Claude CodeLoop, 45+ tools, sandbox/permissions, MCP, hooks, skills, subagents, 5-strategy compaction, pluginsCLAUDE.md hierarchy (enterprise/project/user/local), ~/.claude/skills/, .claude/settings.json hooks~/.claude/CLAUDE.md + one skill = productiveShopify Roast (workflow shell), RDCO/Ray (vault + 60+ skills), affaan-m’s “everything-claude-code” perf system
CursorIDE-integrated agent, semantic+grep search, browser, terminal, .cursor/hooks.json, MCP, worktree sandbox, Cursor SDK (2026, breaks agents out of the editor).cursor/rules/*.mdc (project), .cursor/commands/ (skills, nightly)One .cursor/rules/ markdown file with build/test commandsCursor SDK lets teams put Cursor’s agent in CI/runtime, replacing the IDE shell
OpenAI Codex CLIRust loop, tools, ~/.codex/config.toml, MCP via STDIO/HTTP, agent skills, subagents (explicit-only), Agents SDK escape~/.codex/AGENTS.md global + project AGENTS.md, [agents] config, MCP serversAGENTS.md + codex install = productiveCodex-as-MCP-server pattern: orchestrate Codex from Agents SDK for deterministic pipelines
AiderRepo map (PageRank over symbol graph), git auto-commit per change, edit-format coders (EditBlock, UnifiedDiff, Architect, etc.), LiteLLM 100+ provider routingCONVENTIONS.md (read into prompt), YAML config, model selectionpip install aider-chat + aider in repo = productiveAider is itself often the rolled-their-own atop OpenAI/Anthropic SDKs - very thin foundation
ClineVS Code extension, ReAct loop, plan/act mode toggle, browser tool, tool-creation-on-the-fly, MCP first-class.clinerules/*.md (workspace) + global rules, Memory Bank pattern, conditional rules with YAML frontmatter globOne .clinerules/ file = productiveMemory Bank: operators wire their own persistent memory layer because Cline’s session memory is shallow
ContinueAgent / Chat / Plan modes, MCP, custom slash commands, async “Continuous AI” pivotconfig.yaml rules (text or markdown), baseAgentSystemMessage, MCP additions, Mission Control central rule registryInstall + config.yaml + one rule = productivePivoted hard to async/CI agents in 2026, ceding interactive IDE ground to Cursor
OpenHandsV1 immutable event-log architecture, Docker sandbox, CodeAct agent, Pydantic-typed tools, MCP-aligned, GitHub integration, micro-agentsCustom agents, micro-agents (small task-specific), tool definitionsDocker + one config = productive; cloud version one-clickAll-Hands-AI run their own platform on top; SWE-Bench 72% baseline is the rolled-their-own benchmark target
SWE-agent (Princeton)The “ACI” academic reference: linter-gated edits, custom file viewer, history processors, tools for repo-scale navigationConfigurable via YAML, prompt templatespip install + LM key + GitHub issue URLPure research harness; the reference everyone else implicitly compares to
Hermes Agent (Nous Research)“First personal AI agent that ships with the harness already built in” - automated 5-layer harness (loop/tools/memory/skills/sandbox), self-improving skill writer (auto-generates ~/.hermes/skills/* after notable runs), multi-platform gateway (Telegram/Discord/Slack/WhatsApp/Signal/Email/CLI)Skills accumulate automatically rather than being authored, persistent cross-session memory by designOne install command + auth = productive; harness ratchets itselfThe first harness explicitly automating the ratchet. 27k+ GitHub stars by Apr 2026.
Shopify RoastRuby DSL workflow orchestrator on top of Claude Code. Convention over configuration (Rails philosophy). CodingAgent invokes Claude Code as a tool inside structured workflowsWorkflow definitions in Ruby, prompt files, step compositiongem install roast + workflow.rb = productiveThis IS the “rolled their own” - but it’s a thin shell on top of Claude Code, not a from-scratch harness. River the agent is built on top of Roast + Claude Code + Shopify’s MCP servers + LLM proxy.

Per-harness deep-dive

Claude Code (Anthropic)

Universal layer is the most architecturally complete in market: async-generator loop, 45+ tools classified by concurrency (read = parallel, write = serial), 7-stage permission pipeline, 5-strategy compaction cascade (microcompact / snip / auto-compact / context-collapse), four-tier instruction hierarchy (enterprise / project / user / local), four extension mechanisms (skills / hooks / MCP / plugins), subagent task isolation with disk-backed coordination. Personal-fit lives in CLAUDE.md files at four tiers + ~/.claude/skills/ markdown + .claude/settings.json hooks + per-project .mcp.json. Bootstrap floor is genuinely tiny: install Claude Code, add a CLAUDE.md, you are productive in 10 minutes. RDCO/Ray is the proof: 60+ skills, 1490 vault docs, multi-MCP, deterministic audit hooks - all built ON TOP of Claude Code without forking anything. Sources: Anthropic docs, Rohit teardown, Alex Op full-stack writeup, vault: Claude Code architecture teardown.

Cursor

Universal layer is IDE-first: agent runs inside the editor with semantic+grep codebase search, browser tool, terminal, .cursor/hooks.json for pre/post-action scripts, MCP for external services, git-worktree sandboxes for parallel agents. The 2026 surprise was the Cursor SDK which breaks the agent out of the IDE - operators can run Cursor agents in CI / runtime / arbitrary contexts. Personal-fit migrated from .cursorrules (legacy single file) to .cursor/rules/*.mdc (per-glob, version-controlled, scoped). Bootstrap is one rules file with build/test commands. Cursor’s published guidance treats rules as “the single biggest lever to make Cursor stop hallucinating.” Sources: Cursor agent best practices, Cursor docs rules, vault: AlphaSignal Cursor SDK followup.

OpenAI Codex CLI

Universal layer is a Rust-built loop with native tool execution, MCP via STDIO/HTTP servers in ~/.codex/config.toml, agent skills invoked via $skill-name, opt-in subagents configured in [agents] block, and an Agents SDK escape hatch that exposes the entire CLI as an MCP server (so larger orchestrators can invoke it deterministically). Personal-fit is AGENTS.md (the same standard Cline / others read), with AGENTS.override.md for per-machine overrides. Bootstrap is npm install -g @openai/codex (or brew) + an AGENTS.md file. Codex is the most “Anthropic-lookalike” of the competing harnesses - Anthropic’s Claude Code shipped first, OpenAI followed with very similar shape. Sources: Codex CLI docs, AGENTS.md guide.

Aider

Universal layer is the smallest of the major harnesses but disproportionately load-bearing for one capability: the PageRank-based repo map that builds a directed graph of symbol definitions+references across the entire codebase, then ranks files by relevance and renders the top-ranked definitions as elided code views inside the token budget. This is the part Hermes-agent and others now publicly cite as the gold standard for repo-scale context selection. Plus: every agent change is an atomic git commit, multiple coder variants for different edit formats (EditBlockCoder, WholeFileCoder, UnifiedDiffCoder, ArchitectCoder), LiteLLM-routed model agnosticism (100+ providers). Personal-fit is CONVENTIONS.md read straight into the prompt + YAML config + model choice. Bootstrap floor: pip install aider-chat && aider inside a git repo. Aider is the most purist “fat skills, thin harness” implementation in market. Sources: Aider repo map docs, Simran Chawla’s architectural analysis.

Cline

Universal layer: VS Code extension running a ReAct (Reason-Act-Observe) loop with plan/act mode toggle (operator can force planning before action), browser tool, MCP first-class with the ability to create new MCP servers from inside Cline (self-extending toolkit), per-tool human approval gates. Personal-fit: .clinerules/*.md workspace rules + global rules (~/Documents/Cline/Rules), conditional activation via YAML frontmatter glob patterns, plus the Memory Bank pattern (operator-authored markdown structure that Cline reads at session start to recover state across forgettable sessions). Cline reads .clinerules/, .cursorrules, .windsurfrules, AND AGENTS.md - explicitly cross-tool compatible. Bootstrap: install extension + one rules file. Sources: Cline rules docs, Memory Bank pattern.

Continue

Universal layer: VS Code / JetBrains extension with Agent / Chat / Plan modes, MCP support, custom slash commands. Distinguishing 2026 move: pivoted to “Continuous AI” - async background agents that enforce standards in CI, conceding interactive IDE ground to Cursor. Personal-fit: config.yaml (or .md) rules, baseAgentSystemMessage model-level overrides, Mission Control central rule registry. Bootstrap is install + config.yaml + one rule. Continue is the harness whose moat moved fastest: started as Cursor competitor, became async-first-team-process tool. Sources: Continue docs rules, Continue.dev pivot review.

OpenHands (formerly OpenDevin)

Universal layer: V1 architecture with immutable event log (every action and observation is an event, enabling deterministic replay and pause/resume - a feature most other harnesses lack), Docker sandbox, CodeAct agent (the SWE-Bench 72% baseline against Claude Sonnet 4.5), Pydantic-typed tools, MCP-aligned sandboxing, GitHub integration, “micro-agents” for small task-specific work, cloud sandboxes for parallel agent execution. Personal-fit: micro-agent definitions, custom agent classes. Bootstrap: Docker + config OR one-click cloud. The most “operator owns the agent’s full execution history” of any open-source option. Sources: OpenHands.dev, OpenHands V1 architecture, arxiv 2407.16741.

SWE-agent (Princeton)

Pure academic reference. The “Agent-Computer Interface” (ACI) thesis: how the agent talks to the computer matters more than which model it is. Innovations later adopted everywhere: linter runs on every edit and BLOCKS syntactically-broken code from being committed, custom file viewer instead of cat, history processors that compress context. Bootstrap: pip install + LM key + GitHub issue. Not a productized harness for daily operator use - it’s the citation other harnesses use to justify their tooling decisions. Sources: arxiv 2405.15793, SWE-agent docs.

Hermes Agent (Nous Research)

The newcomer that matters most for RDCO. Universal layer: the harness already built in - all 5 layers automated (loop / tools / memory / skills / sandbox). Self-improving learning loop: after each task, Hermes evaluates whether to write a skill (triggers: tool called >5 times, mistake-then-fix, user correction, unobvious-but-effective path). Auto-writes to ~/.hermes/skills/* without operator authoring. Multi-platform gateway: native to Telegram, Discord, Slack, WhatsApp, Signal, Email, CLI - the “channels” architecture that Ray independently arrived at, but shipped as default. Persistent cross-session memory and “deepening model of who you are” by design. Personal-fit accumulates automatically rather than being hand-authored - this is a structural bet that the personal-fit layer can be auto-ratcheted, not just hand-curated. Bootstrap: install + auth = productive. 27k+ GitHub stars by Apr 2026. Sources: hermes-agent.nousresearch.com, DataCamp tutorial, DEV writeup. Caveat: I have not run Hermes; the “5 layers automated” claim is from their docs and a third-party review, not first-hand verified.

Shopify Roast (the “rolled their own” example)

This is the critical clarification for the founder’s question. Shopify did NOT build River from scratch as a competing harness to Claude Code. They built Roast - a Ruby workflow orchestration framework that follows Rails’ “convention over configuration.” Roast wraps Claude Code as a tool: CodingAgent is the integration point that invokes Claude Code from inside structured workflows. Workflows can interleave agentic Claude Code steps with deterministic non-AI Ruby code. Roast 1.0 (Apr 2026) replaced YAML configs with a pure Ruby DSL. River - the Slack-native agent that opens 1,870+ PRs/week and crossed 50% of Shopify’s code being AI-generated - sits on top of: (a) Shopify’s internal LLM proxy, (b) “MCP everything” internal MCP servers (GSuite, Slack, Salesforce, internal data warehouses), (c) Roast for workflow shape, (d) Claude Code for the agentic execution underneath. Sources: Shopify Engineering: Introducing Roast, Shopify/roast GitHub, ZenML LLMOps DB writeup, vault: Tobi River public-channel agent, Bessemer Atlas: Shopify AI playbook, First Round AI feature.

The “rolled their own” pattern

The founder’s framing was that “Shopify rolled their own [River].” Sharp correction: Shopify rolled their own ORCHESTRATION SHELL (Roast in Ruby), but the agent execution underneath is Claude Code. This is the dominant pattern in 2026 - not “build a competing harness from scratch,” but “build a thin domain-shaped shell that invokes one of the universal harnesses for actual agent work.” The escape valves teams build are:

  1. Workflow orchestrator on top (Shopify Roast around Claude Code) - when the universal harness’s loop is too unstructured for a specific repeatable process. Adds determinism in Ruby/Python code, calls the agent for the latent steps.
  2. Public-channel deployment shell (River around Roast around Claude Code) - when the org needs apprenticeship-by-osmosis, the harness operator builds a Slack-native surface that forces the agent to refuse DMs.
  3. Domain-specific MCP server fleet (Shopify’s internal MCPs for GSuite/Slack/Salesforce/data warehouse) - the universal harness ships zero domain knowledge, so operators add MCP servers that expose their internal systems with the right authn and shape.
  4. CI / async deployment (Continue’s pivot, Cursor SDK, OpenHands cloud) - when the operator wants the agent off the dev’s machine and into background runtime.
  5. Auto-ratcheting skill accumulation (Hermes) - when the operator wants to automate the “every failure becomes a rule” discipline so the personal-fit layer self-builds.

What teams almost NEVER do in 2026: write a competing model-loop-tools-context layer from scratch. Even Shopify - the most-cited “rolled their own” example - chose to wrap Claude Code rather than rebuild it. The economic gravity is decisive: the universal layer is too good and too cheap to rebuild. The interesting innovation moved up to orchestration / deployment-shape / personal-fit accumulation.

This matches Moura’s dissent (06-reference/2026-04-13-moura-entangled-software-agent-harnesses-dead): harnesses commoditize fast. The durable shape is “entangled software” - data + workflow + agent in one product surface. Roast is exactly that for Shopify; River is the deployment-shape on top. RDCO is the same shape for solo-founder COO work; Ray is the deployment-shape on top.

How far can bootstrap go?

Founder’s question dead-on: Ray is a thin wrapper around Claude Code. Channels MCP turned on, knowledge base provisioned, first few skills configured (SOUL.md, CLAUDE.md), then everything else built up through dialogue. Mapping that against the harness-moat-two-layers framework:

Three categories of operator emerge:

  1. Bootstrap-and-stay (Ray, most current Claude Code users): Run the universal harness as-shipped, build personal-fit on top through markdown + skills + MCP picks. No fork, no shell, no orchestration layer. The “thin wrapper” pattern Ben describes. Productive in days, mature in months. The vast majority of operators can stay here forever and never hit a real wall.

  2. Workflow-shell (Shopify Roast, operators with repeatable structured processes): When you have a process that repeats hundreds of times per week with the same shape (Shopify code review at 1,870 PRs/week), the unstructured agent loop costs more than building a Ruby/Python orchestrator that calls the agent for latent steps and uses deterministic code for the rest. This is the threshold where teams “roll their own” - but they’re rolling their own SHELL, not their own harness.

  3. Full-stack shell (River, very large orgs): Apprenticeship-shape requirements (osmosis learning, public-channel constraints, multi-team skill sharing) force a deployment shell on top of the workflow shell. Only worth building when the org is large enough that the visibility flywheel actually compounds. Tobi’s dataset: 5938 employees, 4450 channels.

The bootstrap floor is genuinely productive. Two data points:

The “must roll your own” threshold for solo-operator and small-team work is further out than founder intuition suggests. The threshold kicks in only when you have a high-frequency repeatable structured process where unstructured agent work is too noisy (Roast threshold) or when org-shape demands osmosis learning across many humans (River threshold). Neither applies to RDCO’s solo-founder COO surface today.

Synthesis for RDCO

For the Ray-as-a-Service / Ray-Starter-Kit decision, the research supports a productizable kit at the universal-harness + scaffolding-skills layer, not a from-scratch harness. The market reality is: every major harness ships a similar universal layer, every operator’s personal-fit layer is unique and earned, and the interesting product opportunity sits between them - the discipline + scaffolding + first-batch skills that compress the personal-fit accumulation period from months to weeks.

The right shape for the kit: package the universal-discipline layer (the ratchet pattern, hooks-as-enforcement, subagent routing for context rot, splits-for-evaluation, skill format, generative-UI return channel, vault-as-nervous-system, todo+loop vs Notion-queue distinction, memory file format) plus a Layer-1.5 swap kit (MCP server picks, deployment-target swaps, bets.json template, skill ON/OFF menu) plus a starter Layer-2 (10-20 baseline rules earned across multiple founders that are likely to apply broadly). What stays bespoke: each operator’s CLAUDE.md hard rules, voice, and accumulated memory files. The pitch is “we sold you the harness discipline and starter rules; your job is to operate it long enough to earn your own personal-fit layer.”

The escape valves to design INTO the kit (so operators can extend without forking): (1) a Roast-style optional workflow orchestrator slot for when a structured process emerges; (2) a public-channel surface scaffold (HQ + decisions click-back rail + iMessage return channel - already built for RDCO, generalizable) for when collaborators arrive; (3) an MCP “hot swap” registry so operators can add their domain stack without touching skill code; (4) an auto-ratchet hook that emulates Hermes’s pattern - flag candidate skills/rules from session events for human approval, accelerating the Layer 2 fill-in. The most defensible RDCO product is NOT a competing harness; it’s the operator’s playbook + scaffolding + ratchet automation that turns Claude Code (or whichever universal harness) from “powerful tool” into “Ray-class operator” in 6 weeks instead of 6 months.

One sharp risk to flag: Hermes’s “harness already built in, self-improving from day one” is direct competitive overlap with the RDCO Starter Kit thesis. If Hermes’s auto-ratchet + multi-platform gateway works as advertised, it eats the bottom of the market RDCO would otherwise serve. The differentiator must be: RDCO sells the operating discipline + earned-rule starter pack + the specific skill set for COO-class founder work (newsletter, deep research, vault hygiene, finance pulse, content production), not just “an agent that learns from you.” Worth a focused Hermes evaluation - install it, run it for two weeks alongside Ray, measure where the auto-ratchet succeeds and where it produces noise.

Open follow-ups

Sources

Vault

Web