06-reference / cross-checks

cross check agent architecture

Sat Apr 11 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·cross-check

Cross-Check: Agent Architecture Cluster

Sources checked

Primary (assigned)

  1. 2026-04-11-garry-tan-thin-harness-fat-skills — Garry Tan’s 5-definition framework (thin harness, fat skills, resolvers, latent vs deterministic, diarization)
  2. 2026-04-10-akshay-pachaar-agent-harness-anatomy — 12 components, 7 decisions, the “harness is the product” thesis
  3. 2026-04-10-paddy-srinivasan-agentic-cloud — DigitalOcean CEO, multi-model routing, the Advisor pattern
  4. 2026-04-10-jaya-gupta-anthropic-moat — trust/permission as the scarce asset, capability-governance loop
  5. 2026-03-25-seattle-data-guy-know-nothing-and-be-happy — the comprehension-atrophy failure mode
  6. 2026-04-10-ramp-labs-latent-briefing — KV cache compaction for multi-agent token efficiency

Discovered via QMD (additional relevant sources)

  1. 2026-04-07-claude-code-architecture-teardown — Rohit’s 10-layer reverse engineering of Claude Code’s actual architecture (four-layer framework, async generator loop, 45+ tools, compaction cascade)
  2. 2026-02-27-trq212-seeing-like-an-agent — Thariq (Anthropic insider): tool design shaped to model abilities, progressive disclosure, the TodoWrite-to-Tasks evolution
  3. 2026-04-09-every-four-ai-agents — Every’s 4-agent production setup: shared-database-as-coordination, “describe outcomes not steps”
  4. 2026-04-08-better-harness-evals-hill-climbing — Viv (LangChain): evals as training data for harness engineering, the better-harness loop

Contradictions

C1. “The model doesn’t matter” vs. “multi-model routing is the future” — HIGH

Tan says the 2x and 100x engineers use the same models; the difference is architecture (fat skills vs thin skills). The model is commoditized; the harness is the product.

Paddy says one model won’t do everything; the winning systems will route work across models based on cost, latency, and quality. The Advisor pattern is the future.

Can they both be right? Partially. Tan is arguing that within a single model tier, harness quality explains performance variance. Paddy is arguing that across model tiers, routing is an optimization problem. These are different claims at different abstraction levels. But they carry different strategic implications: Tan says invest in skills, Paddy says invest in routing infrastructure. For RDCO at current scale, Tan’s framing is more actionable — we don’t have enough task volume to justify a routing layer. But Paddy’s thesis becomes relevant the moment we split into multi-agent (the 5-agent autoinv vision), where the Research agent might justify Opus while Paper Testing runs fine on Sonnet.

Resolution: Not a true contradiction. They’re arguing about different surfaces. But our vault treats them as part of the same thesis (“the harness matters more than the model”), and that’s imprecise. The harness matters more than which model you pick, but how many models you route across is a separate architectural question that Tan’s framework doesn’t address.

C2. “Thin harness” vs. “12 components of a production harness” — MEDIUM

Tan says the harness should be ~200 lines, thin, dumb. Push intelligence UP into skills, push execution DOWN into deterministic tools.

Pachaar identifies 12 components (orchestration loop, tools, memory, context management, prompt construction, output parsing, state management, error handling, guardrails, verification loops, subagent orchestration, plus the implicit twelfth). Each has meaningful complexity. The error recovery system alone is 823 lines in Claude Code’s implementation (per Rohit’s teardown).

Thariq confirms from inside Anthropic that tool design is an art requiring constant iteration, and that Claude Code has ~20 tools carefully maintained — not a trivial layer.

Can they both be right? Yes, but they’re using “harness” to mean different things. Tan means the user-built harness — the orchestration script a builder writes to connect their skills to a model. Pachaar means the platform-level harness — the full Claude Code runtime including all internal machinery. The “thin harness” advice applies to what we build on top of Claude Code. The “12 components” describe what Claude Code itself handles underneath.

Resolution: This is a scoping ambiguity, not a real contradiction. But it’s dangerous because it lets builders think “thin harness” means “don’t think about error handling, context management, or verification” — when actually it means “let the platform handle those and focus your custom layer on domain skills.” Our vault should be more explicit about which layer we’re discussing when we say “harness.”

C3. “Describe outcomes, not steps” vs. “Fat skills as procedures” — LOW

Every (lesson 1): instructions that specify the desired end state outperform step-by-step procedures. Outcomes are stable; procedures become brittle.

Tan: skills are reusable markdown documents that teach the model a process. The skill describes judgment and procedure; the invocation supplies parameters.

Thariq: as models improve, scaffolding that once helped starts constraining. The TodoWrite tool became a cage for Opus 4.5.

Tension: If models get better at reasoning from outcome descriptions, detailed procedural skills will eventually over-constrain them. The Every approach (“describe the outcome, let the model figure out how”) is a bet on model capability growth. The Tan approach (“encode the procedure in a skill”) is a bet on consistency and transferability.

Resolution: Both are correct at different capability thresholds. The right test is Pachaar’s “future-proofing test”: if a skill’s performance scales up with more powerful models without modification, the design is sound. If you have to keep removing constraints as models improve, the skill was too procedural. RDCO should track which skills break this test.


”Says X, Actually Y” Gaps

G1. “The vault is the comprehension layer” — we say this, but is it actually working?

Our claim (SDG mapping): The vault inverts SDG’s nightmare scenario because every experiment writes a markdown note, every decision writes a project doc. Future sessions can pick up the full causal chain without re-deriving.

The evidence that contradicts this: The PM1e working-context confabulation — numbers that didn’t exist in the CSVs became “fact” in a summary and were nearly sent to the founder. This is documented in the SDG article itself. The vault almost became the vector for exactly the failure mode it’s supposed to prevent: plausible lies written in confident markdown.

Gap: We say the vault is the defense against SDG’s critique, but the vault is only as good as the verification discipline behind it. The vault doesn’t self-verify. If I write a confabulated summary and it gets filed, the vault amplifies the error rather than catching it. The defense isn’t the vault — it’s the BiasAudit pattern + founder reading + re-derivation from source data. The vault is the storage layer; verification is the actual defense. Our framing overstates the vault and understates the process.

G2. Ramp’s Latent Briefing is positioned as relevant to our architecture, but it isn’t usable by us

Our claim (Ramp mapping): Latent Briefing is “the kind of primitive that would let us scale multi-agent workflows without the cost explosion.”

Actually: The technique requires KV cache access at the model level — it only works with open models we run ourselves (their example: Qwen3-14B), not with hosted APIs like Claude’s endpoint. We run Claude Code against Anthropic’s API. We don’t run open models. We don’t have inference GPUs. The technique is architecturally inaccessible to us at our current setup and likely for the foreseeable future.

Gap: We filed it as strategically relevant but it’s tactically irrelevant. It belongs in the “interesting research” bucket, not in the “informs our architecture” bucket. Our actual cross-session memory primitive is the text-level bridge notes (working-context.md), which the Ramp note itself acknowledges as the realistic alternative. The vault note should be more honest about this distance.


Stale Assumptions

S1. “CLAUDE.md is our resolver layer” — Tan’s lesson says it’s getting too long

When we said it: The Tan article mapping (April 11) explicitly maps our CLAUDE.md + SOUL.md to Tan’s “resolvers.” But the same note says Tan’s own CLAUDE.md was 20,000 lines before he cut it to 200 lines of pointers.

What’s changed: Our CLAUDE.md is growing with every session. It accumulates memory entries, instructions, project references. If Tan’s lesson is correct — that a 20,000-line instruction file degrades performance — we need to audit ours now, not after it becomes a problem. The note flags “worth auditing” but no action was taken.

Update needed: Schedule a resolver audit. Measure current CLAUDE.md + SOUL.md token count. Determine whether we’re loading context that should be demand-loaded via skills or QMD queries instead.

S2. “Single-agent first, multi-agent later” may be past its window

When we said it: The Pachaar article (April 10) maps our single-threaded autoinv approach to Anthropic + OpenAI guidance: “maximize a single agent first, split only when tool overload exceeds ~10 overlapping tools.”

What’s changed: The vault now documents 20+ skills, multiple MCP servers (QMD, xmcp, Discord, iMessage, Notion, Gmail, Calendar, Playwright, Firebase), and a growing tool surface. Thariq says Claude Code itself has ~20 tools and “constantly asks whether all of them are necessary.” The Pachaar threshold of ~10 overlapping tools may already be exceeded in our daily operation. The “single agent first” assumption may be stale — not because we should rush to multi-agent, but because we should assess whether tool overload is already degrading performance.

Update needed: Inventory the current tool surface loaded in a typical session. If it exceeds the threshold, evaluate whether skill-based lazy loading (Pachaar’s decision 6: “expose minimum tool set needed for the current step”) is already handling it, or whether we need explicit scoping.


Missing Voices

M1. No one in this cluster argues AGAINST the harness thesis

Every source in this cluster — Tan, Pachaar, Paddy, Gupta, SDG, Ramp, Rohit, Thariq, Every, Viv — agrees that the harness/architecture/infrastructure around the model is where value lives. The model is commoditized; the surrounding system is the differentiator.

Who disagrees? We have no filed counter-argument. Candidates:

Action: Find and file at least one serious counter-argument to the harness-over-model thesis. Without it, the vault has a confirmation bias on its most load-bearing architectural belief.

M2. No infrastructure cost analysis across sources

Multiple sources discuss cost (Ramp’s token explosion, Paddy’s cost-aware routing, Pachaar’s “error handling compounds”) but no one in the cluster provides a concrete cost model. What does it actually cost to run a production agent system per month? How does that scale with usage? What’s the break-even point where routing or compaction pays for itself?

Missing: A source that provides real production cost numbers for agent systems. The Ramp Labs paper has benchmark metrics but no dollar figures. Every doesn’t mention costs. Tan doesn’t mention costs. This is a gap in the cluster’s practical value.

M3. No end-user/customer perspective

All sources are from builders, investors, or vendors. No source represents the end user of an agent system — the person whose code is being written, whose meetings are being processed, whose data is being queried. SDG comes closest (the nightmare engineer who can’t debug), but even he is writing from a builder’s perspective about builder failure modes.

Missing: A source that evaluates agent architecture from the perspective of someone affected by agents rather than someone building them. Trust, quality, reliability, and explainability look different from the receiving end.


Convergences Worth Naming

V1. “The harness is the product” — 6+ independent sources

The strongest convergence in the vault. Sources arriving at this conclusion independently:

SourceTheir framing
Tan”The 2x and 100x people use the same models. The difference is architecture.”
Pachaar”Two products using identical models can have wildly different performance based solely on harness design.” LangChain jumped from outside top 30 to rank 5 by changing only infrastructure.
Paddy”The winning systems will not be the most powerful. They will be the most efficiently orchestrated.”
RohitLayer 4 (infrastructure) is “where products die.”
Viv”If you’re not running evals, you’re not doing harness engineering. You’re doing harness guessing.”
ThariqTool design shaped to model abilities, not imagined capabilities, is what makes Claude Code work.

Empirical anchor: McKinsey’s 2025 Global Survey on State of AI (N=1,993 execs) gives the harness gap an empirical number: 62% of orgs are experimenting with agents, only 23% are scaling them — a 39-point “trying vs. shipping” gap that maps directly to the fat-skills / thin-harness framing. The orgs that crossed from experiment to scale are the ones who built (or bought) a real harness; the 39-point gap is the market for harness-as-product.

Recommendation: This convergence is strong enough to name as a concept article: “The Harness Is the Product” — a synthesis article that captures the multi-source agreement and its RDCO implications. It would also make a strong Sanity Check newsletter topic.

V2. “Scaffolding is temporary” — the co-evolution principle

Three sources independently describe the same dynamic: as models improve, harness complexity should decrease.

SourceTheir framing
PachaarManus rebuilt five times in six months, each rewrite removing complexity. “If performance scales up with more powerful models without adding harness complexity, the design is sound.”
ThariqTodoWrite became a cage for Opus 4.5. They replaced it with a simpler Task tool. “Anthropic regularly deletes planning steps from Claude Code’s harness as new model versions internalize that capability.”
Every”Describe outcomes, not steps” — betting on model capability growth rather than encoding procedure.

Recommendation: This is the most actionable convergence for RDCO. It suggests we should periodically audit our skills for over-specification — steps that made sense for Opus 4 may over-constrain Opus 5. A quarterly “scaffolding audit” where we test whether removing skill steps improves or degrades output quality.

V3. “Trust is the moat, not intelligence” — the permission convergence

SourceTheir framing
Gupta”The scarce asset in enterprise AI may be shifting from intelligence to permission.”
SDGThe failure mode is trusting AI output without comprehension — trust without earned warrant.
PachaarPermission architecture is one of the 7 core decisions: permissive (fast, risky) vs restrictive (safe, slow).
RohitClaude Code’s permission system is a 7-stage pipeline; this is not an afterthought.

Recommendation: Already partially captured in the Gupta note. Could be elevated to a concept article: “Permission as Architecture” — trust isn’t a business concern layered on top; it’s a design constraint that shapes every technical decision.

V4. “The database is the intelligence” — structured context compounds

SourceTheir framing
Every”Your database is the agent’s intelligence.” Anton isn’t smart because the prompt is clever — it’s smart because the data is well-structured.
TanDiarization: the model reads everything about a subject and writes a structured profile. The vault IS the product.
PachaarMemory as multi-timescale: short-term, long-term, the agent treats its own memory as a “hint” and verifies against actual state.
RampCross-agent context sharing is the bottleneck; the quality of what’s passed between agents determines system quality.

Recommendation: This validates our vault architecture more than any single source does. The concept article candidate: “Structured Context Compounds” — the quality of what the agent knows is a function of how the knowledge is organized, not how much there is.


Immediate (this week)

  1. Resolver audit. Measure CLAUDE.md + SOUL.md total token count. If >5,000 tokens, identify what can move to on-demand skill loading or QMD queries. Tan’s 200-line target is the benchmark. (Source: S1)

  2. Tool surface inventory. List all tools loaded in a typical session. Count against the ~10-tool threshold from Pachaar. Determine whether lazy loading is already handling this or whether explicit scoping is needed. (Source: S2)

  3. Reclassify Ramp Latent Briefing. Add a note to the Ramp article making explicit that the technique is architecturally inaccessible to RDCO at current setup. Move it from “informs our architecture” to “interesting research, revisit if we run open models.” (Source: G2)

Short-term (this month)

  1. File a counter-argument. Find and vault at least one serious source arguing against the harness thesis or against agent architectures generally. The vault’s biggest epistemic risk right now is confirmation bias on its core thesis. (Source: M1)

  2. Write concept article: “The Harness Is the Product.” Synthesize V1 across all six sources. This is the vault’s strongest cross-source convergence and deserves a named concept. (Source: V1)

  3. Clarify “harness” scoping in vault vocabulary. When we say “harness,” are we talking about the user-built layer (Tan’s ~200 lines) or the platform layer (Pachaar’s 12 components)? Add a disambiguation note or vault glossary entry. (Source: C2)

Medium-term (next quarter)

  1. Build the scaffolding audit habit. When a new model version drops, test whether removing steps from our most-used skills (check-board, process-newsletter, compile-vault) improves or degrades output. This is the co-evolution principle applied. (Source: V2)

  2. Build an /improve skill. Tan’s framework includes a self-improvement loop that reads feedback and rewrites skills. We do this manually. Automating it — even partially — is the next capability tier. The Viv evals-as-training-data methodology provides the how. (Sources: Tan + Viv)

  3. Add verification loops. Pachaar says this is “what separates toy demos from production agents.” We don’t have formal verification in our skill workflows. Adding a post-execution check (even a simple “does the output match the expected format”) would be a meaningful quality improvement. (Source: Pachaar component 10)


Key Question Answers

Do Tan’s “fat skills” and Pachaar’s “12 components” describe the same thing? No. They describe different layers of the same system. Tan’s fat skills are the user-built domain intelligence layer. Pachaar’s 12 components are the platform infrastructure underneath. Tan’s “thin harness” is thin because Pachaar’s 12 components exist at the platform level. They’re complementary, not competing.

Does Paddy’s “multi-model routing” contradict Tan’s “the model doesn’t matter”? Not directly. Tan says model choice explains less variance than harness quality (within a tier). Paddy says routing across tiers is an optimization problem (across tiers). Both can be true. But Tan’s framing is incomplete — it doesn’t address the cost/quality tradeoff that routing solves. For RDCO, this becomes real when we go multi-agent.

Does SDG’s critique identify a failure mode Tan doesn’t address? Yes. Tan’s framework assumes the operator maintains comprehension of what the skills do and what the agent produces. SDG’s critique is that this comprehension atrophies precisely because the system works well — the better the agent, the less the human reads. Tan’s self-improvement loop (skills rewrite themselves) actually accelerates this: if skills are rewriting themselves, who verifies the rewrite? Our vault’s own PM1e confabulation incident is the proof case. The defense is verification discipline (BiasAudit, founder reading, re-derivation from source), not architecture.

Is anyone arguing AGAINST the harness thesis? Not in our vault. This is the biggest gap. See M1.

What RDCO assumptions should we update?