The Anatomy of an Agent Harness — @akshay_pachaar

Why this is in the vault

Founder flagged this alongside the Ramp Latent Briefing paper: “good content for framing the agentic enablement and custom harness development.” This article is the framing piece — it gives us the conceptual map for what a production agent system actually looks like. The Ramp paper is about optimizing one piece of that map (cross-agent memory); this article is the map itself.

Directly relevant to our automated investing 5-agent vision and to the broader Ray Data Co thesis of building infrastructure for agents.

The core insight

The agent is the emergent behavior. The harness is the machinery producing it. When someone says “I built an agent,” they mean they built a harness and pointed it at a model.

The term was formalized in early 2026 but the concept existed before. Anthropic’s Claude Code docs say explicitly that “the SDK is the agent harness that powers Claude Code.” OpenAI’s Codex team uses the same framing. LangChain’s Vivek Trivedy has the best one-liner: “If you’re not the model, you’re the harness.”

Two products using identical models can have wildly different performance based solely on harness design. LangChain demonstrated this on TerminalBench 2.0 — they changed only the infrastructure wrapping their LLM (same model, same weights) and jumped from outside the top 30 to rank 5. A separate research project hit a 76.4% pass rate by having an LLM optimize the infrastructure itself, surpassing hand-designed systems.

The harness is not a solved problem or a commodity layer. It’s where the hard engineering lives.

The Von Neumann analogy

Beren Millidge (2023): a raw LLM is a CPU with no RAM, no disk, no I/O. The context window is RAM (fast but limited). External databases are disk storage (large but slow). Tool integrations are device drivers. The harness is the operating system.

“We have reinvented the Von Neumann architecture.”

Three levels of engineering

Three concentric layers surround the model:

Prompt engineering — crafts the instructions the model receives
Context engineering — manages what the model sees and when
Harness engineering — encompasses both, plus tool orchestration, state persistence, error recovery, verification loops, safety enforcement, and lifecycle management

The harness is not a wrapper around a prompt. It is the complete system that makes autonomous agent behavior possible.

The 12 components of a production harness

Synthesized across Anthropic, OpenAI, LangChain, and the practitioner community. I’m paraphrasing the key points for each in my own words rather than reproducing the article verbatim.

1. The orchestration loop

The heartbeat: Thought → Action → Observation (TAO), aka the ReAct loop. Assemble prompt, call LLM, parse output, execute tool calls, feed results back, repeat until no more tool calls. Mechanically it’s a while loop; the complexity lives in what the loop manages, not the loop itself. Anthropic frames their runtime as a “dumb loop” where all intelligence lives in the model.

2. Tools

The agent’s hands. Schemas (name, description, parameter types) injected into the LLM’s context so it knows what’s available. The tool layer handles registration, validation, argument extraction, sandboxed execution, result formatting. Claude Code exposes tools in six categories: file ops, search, execution, web, code intelligence, subagent spawning.

3. Memory

Multi-timescale. Short-term = conversation history within a session. Long-term = persists across sessions. Claude Code’s three-tier hierarchy: lightweight index (~150 chars/entry, always loaded), detailed topic files pulled on demand, raw transcripts accessed via search only. Critical design principle: the agent treats its own memory as a “hint” and verifies against actual state before acting. This matches what I do with the working-context bridge notes in our own setup.

4. Context management

Where many agents fail silently. The core problem is context rot — model performance degrades 30%+ when key content falls in mid-window positions (Chroma research, corroborated by Stanford’s “Lost in the Middle”). Even million-token windows degrade on instruction-following as context grows.

Production strategies:

Compaction — summarize conversation history approaching limits. Claude Code preserves architectural decisions and unresolved bugs, discards redundant tool outputs.
Observation masking — JetBrains Junie hides old tool outputs while keeping tool calls visible
Just-in-time retrieval — lightweight identifiers, load data dynamically. Claude Code uses grep/glob/head/tail rather than loading full files.
Sub-agent delegation — each subagent explores extensively but returns only 1,000-2,000 token condensed summaries

Anthropic’s stated goal: find the smallest possible set of high-signal tokens that maximize likelihood of the desired outcome.

5. Prompt construction

Hierarchical: system prompt, tool definitions, memory files, conversation history, current user message. OpenAI’s Codex uses a strict priority stack: server-controlled system (highest), tool definitions, developer instructions, user instructions (cascading AGENTS.md files with 32 KiB limit), then conversation history.

6. Output parsing

Modern harnesses use native tool calling (structured tool_calls objects), not free-text parsing. The harness checks: tool calls present? Execute and loop. No tool calls? Final answer.

7. State management

LangGraph: typed dicts flowing through graph nodes with reducers, checkpointing at super-step boundaries, resume after interruptions, time-travel debugging. OpenAI: four mutually-exclusive strategies — application memory, SDK sessions, server-side Conversations API, or lightweight previous_response_id chaining. Claude Code’s approach: git commits as checkpoints and progress files as structured scratchpads. We do exactly this.

8. Error handling

A 10-step process with 99% per-step success still has only ~90.4% end-to-end success. Errors compound fast. LangGraph distinguishes four error types: transient (retry with backoff), LLM-recoverable (return as ToolMessage, let the model adjust), user-fixable (interrupt for human input), unexpected (bubble up for debugging). Stripe’s production harness caps retries at two.

9. Guardrails and safety

OpenAI SDK: input guardrails (first agent), output guardrails (final output), tool guardrails (every tool invocation). Tripwire mechanism halts the agent immediately.

Anthropic architecturally separates permission enforcement from model reasoning. The model decides what to attempt; the tool system decides what’s allowed. Claude Code gates ~40 discrete tool capabilities independently across three stages: trust at project load, permission check before each tool call, explicit user confirmation for high-risk operations.

10. Verification loops

This is what separates toy demos from production agents. Three approaches:

Rules-based feedback (tests, linters, type checkers)
Visual feedback (screenshots via Playwright for UI tasks)
LLM-as-judge (separate subagent evaluates output)

Boris Cherny (creator of Claude Code): giving the model a way to verify its work improves quality by 2 to 3×.

11. Subagent orchestration

Claude Code has three execution models: Fork (byte-identical copy of parent context), Teammate (separate terminal pane with file-based mailbox), Worktree (own git worktree, isolated branch). OpenAI SDK: agents-as-tools + handoffs. LangGraph: subagents as nested state graphs.

12. (implicit — the twelfth isn’t in my extracted notes but the article uses the phrase “twelve distinct components” up front; I’m treating this as framing for the whole system rather than a discrete slot)

How the loop works end-to-end

Article walks through a 7-step cycle:

Prompt assembly — system + tools + memory + history + current message. Important context goes at the beginning and end (Lost in the Middle finding).
LLM inference — assembled prompt to API, model generates text + tool call requests
Output classification — no tool calls → end. Tool calls → execute. Handoff → switch agent.
Tool execution — validate args, check permissions, sandboxed execution, capture results. Read-only tools run concurrently; mutating tools serially.
Result packaging — format as LLM-readable. Errors caught and returned as error results so the model can self-correct.
Context update — append to history. If near window limit, trigger compaction.
Loop — return to step 1 until termination.

Termination conditions are layered: text response with no tool calls, max turn limit, token budget exhausted, guardrail tripwire, user interrupt, or safety refusal.

Ralph Loop pattern (Anthropic, for long-running tasks spanning multiple context windows): an Initializer Agent sets up environment (init script, progress file, feature list, initial git commit), then a Coding Agent in every subsequent session reads git logs and progress files to orient itself, picks highest-priority incomplete feature, works on it, commits, writes summaries. The filesystem provides continuity across context windows.

How the major frameworks implement it

Anthropic Claude Agent SDK: query() function, async iterator streaming messages, “dumb loop” where all intelligence is in the model. Claude Code uses Gather-Act-Verify cycle.
OpenAI Agents SDK: Runner class with async / sync / streamed modes. Code-first: workflow in native Python, not graph DSLs. Codex has a three-layer architecture (Codex Core, App Server, client surfaces).
LangGraph: explicit state graph. Two nodes (llm_call and tool_node) with a conditional edge. Evolved from LangChain’s deprecated AgentExecutor.
CrewAI: role-based multi-agent (Agent / Task / Crew) with a Flows layer for “deterministic backbone with intelligence where it matters.”
AutoGen / Microsoft Agent Framework: conversation-driven orchestration, 5 patterns (sequential, concurrent, group chat, handoff, magentic).

The scaffolding metaphor (and the co-evolution principle)

Scaffolding is precise, not decorative. Construction scaffolding is temporary infrastructure workers use to reach floors they otherwise couldn’t — it doesn’t do the construction, but without it nothing gets built.

Key insight: scaffolding is removed when the building is complete. As models improve, harness complexity should decrease. Manus was rebuilt five times in six months, each rewrite removing complexity. Complex tool definitions became general shell execution. “Management agents” became simple structured handoffs.

Co-evolution principle: models are now post-trained with specific harnesses in the loop. Claude Code’s model learned to use the specific harness it was trained with. Changing tool implementations can degrade performance because of this tight coupling.

The “future-proofing test” for harness design: if performance scales up with more powerful models without adding harness complexity, the design is sound.

The seven architectural decisions

Every harness architect chooses:

Single-agent vs multi-agent. Both Anthropic and OpenAI say: maximize a single agent first. Split only when tool overload exceeds ~10 overlapping tools or clearly separate task domains exist. This directly validates the founder’s “single-threaded staged approach” guidance for automated investing.
ReAct vs plan-and-execute. ReAct interleaves reasoning and action at every step (flexible, higher per-step cost). Plan-and-execute separates them. LLMCompiler reports 3.6× speedup over sequential ReAct.
Context window management. Five production approaches: time-based clearing, summarization, observation masking, structured note-taking, sub-agent delegation. ACON research: 26-54% token reduction while preserving 95%+ accuracy by prioritizing reasoning traces over raw tool outputs.
Verification loop design. Computational (tests, linters — deterministic) vs inferential (LLM-as-judge — catches semantic issues, adds latency). Martin Fowler frames this as guides (feedforward, steer before action) vs sensors (feedback, observe after action).
Permission architecture. Permissive (fast, risky) vs restrictive (safe, slow). Context-dependent.
Tool scoping. More tools often means worse performance. Vercel removed 80% of tools from v0 and got better results. Claude Code achieves 95% context reduction via lazy loading. Principle: expose minimum tool set needed for the current step.
Harness thickness. How much logic lives in the harness vs the model. Anthropic bets on thin harnesses; graph-based frameworks bet on explicit control. Anthropic regularly deletes planning steps from Claude Code’s harness as new model versions internalize that capability.

What this means for Ray Data Co

Direct validation of our current approach:

Our 5-agent staged migration matches the “maximize a single agent first” rule exactly
Our autoinv package pattern is the “thin harness” bet — reusable Python functions, fewer classes, strategy logic in ~100-line scripts
Our bridge-notes pattern (../.claude-code/state/working-context.md via PreCompact/SessionStart hooks) is the text-level version of Ramp’s KV cache version
Our TimeSeriesSplit-only validation is the Halls-Moore Ch 3 “guards” pattern formalized

Gaps this article surfaces that we should close:

We don’t have a verification loop yet. Strategies are evaluated post-hoc via metrics, not during execution via guards/sensors. Adding a “PreSignal” hook that checks Brier score against expected before a trade fires would be a real guard.
We don’t have a formal error taxonomy (transient vs LLM-recoverable vs user-fixable vs unexpected). Worth adding when we build the Stage 2 monitor/risk agent.
We don’t have a tool scoping discipline. autoinv exposes all functions at import time. As the package grows, a “load tools on demand” pattern might matter.
The “harness is the product” framing applies to our data-product thesis: if we sell the xmcp-powered sentiment pipeline as a data product for agents, the harness around it IS the product.

The one-liner to internalize:

The next time your agent fails, don’t blame the model. Look at the harness.

2026-04-10-ramp-labs-latent-briefing — companion article, optimizes one piece of the harness (cross-agent memory)
../01-projects/automated-investing/architecture-vision — our target shape, validated by this article’s seven decisions
../01-projects/automated-investing/experiments/consolidation-pass — current thin-harness state
2026-04-10-halls-moore-algo-trading — our existing “guards vs sensors” pattern for backtesting biases
Beren Millidge (2023) “Scaffolded LLMs as Natural Language Computers” — the Von Neumann analogy source

Tracked author

../03-contacts — consider adding Akshay Pachaar (@akshay_pachaar) to the CRM when we open task #4. Co-founder of dailydoseofds, publishes substantive framing work on AI systems.