Better Harness: A Recipe for Harness Hill-Climbing with Evals

From @Vtrivedy10 (Viv, LangChain agents & evals team). One of the clearest distillations of systematic agent improvement methodology published to date — the kind of thing that will look obvious in hindsight, once everyone is doing it.

Core Thesis

Evals are training data for harness engineering. That’s the whole argument. In ML, you have a model + training data + gradient descent and you get a better model. In agent development, the analogous loop is: harness + evals + deliberate harness engineering = better agent.

The implication is uncomfortable for most teams: if you’re not running evals, you’re not doing harness engineering. You’re doing harness guessing. The harness (system prompt, tool descriptions, tool composition, skill instructions) is the only variable you directly control. Evals are the signal that tells you whether a change was an improvement or an accident.

This is not primarily an argument about eval frameworks or testing infrastructure. It’s an argument about epistemology: how do you know if your agent got better?

Three Sources of Evals

Viv identifies three where evals come from:

1. Hand-curated — You write them. You already know the failure modes, so you write test cases that exercise them. Fast to produce, high quality, but limited by your own imagination of what can go wrong.

2. Production traces — Captures from real agent runs, tagged by outcome. This is where leverage lives. Production reveals failure modes you never anticipated, at a volume you couldn’t manufacture manually, from real users with real prompts. The failure modes you didn’t anticipate are exactly the ones that bite you.

3. External datasets — Public benchmarks and community-contributed eval sets. Useful for baseline comparison and for evaluating generalization on tasks outside your specific domain.

Production traces are the highest-leverage source because they compound: as your agent handles more real work, your eval set grows automatically. The gap between “I think this handles all the edge cases” and “here’s what production actually threw at it” is usually embarrassing.

The Better-Harness Loop

The methodology Viv proposes is:

Source evals — draw from all three sources above; prioritize production traces once you have them
Tag evals — label by failure type, difficulty, domain, whatever dimensions help you diagnose
Split optimization/holdout — hold back a portion before you start optimizing; this is the generalization gate
Baseline — run your current harness against all optimization evals; record the score
Optimize — make one change at a time (prompt wording, tool description, composition guidance, etc.)
Validate — score on optimization set again; confirm improvement
Human review — check holdout set manually; verify generalization, catch reward hacking
Repeat — loop back to step 5

The critical constraint: one change at a time. Multiple changes simultaneously creates attribution confusion — you won’t know what caused the improvement or the regression. This is gradient descent applied to harness engineering. You need a clean signal.

Generalization vs. Overfitting

This is where the analogy to ML gets serious. Agents are, in Viv’s phrase, “famous cheaters” at reward hacking. Give an agent a fixed eval set and it will — through prompt sensitivity alone, without any explicit optimization — start to over-index on the specific patterns in that set. The harness improves on the optimization set while quietly degrading on inputs it hasn’t seen.

Two defenses:

Holdout sets — Never touch these during optimization. They exist only as a final generalization check. If your agent improved on optimization evals but holds or degrades on holdout, you’ve overfit. Start over or revert.

Human review as second signal — Metrics lie in specific ways that humans catch. When a change looks good on the numbers, read the outputs. Are they actually better, or are they gaming the scoring rubric? Human review is expensive, which is why you don’t do it every iteration — but it’s the only check on metric gaming that can’t itself be gamed.

Types of Harness Changes

What you’re actually changing during the optimize step:

Prompt and instruction updates — Rewording system prompt sections, adding or removing examples, adjusting tone, clarifying scope
Tool descriptions — How you describe what a tool does shapes when and how the agent uses it. Description engineering is underrated. The skill-creator plugin runs 20 trigger queries (10 should-trigger, 10 should-not) with a 60/40 train/test split just for optimizing skill descriptions — that’s the level of rigor this deserves.
Tool composition guidance — When to use tools in sequence, when to use them in parallel, what to do when a tool returns nothing

Tool description quality is a high-leverage surface. Small rewording changes in how a tool is described often produce disproportionate behavior changes.

Results

Viv tested the methodology with Claude Sonnet 4.6 and GLM-5. Both showed strong generalization from optimization evals to holdout sets — the changes that looked like improvements on the training-side of the split were real improvements, not overfit. This matters because it suggests the loop produces durable signal, not just local optima.

Future Directions Viv Flags

Automated error detection from traces — Using a classifier to identify failure traces without human labeling; the diff between expected and actual output becomes a training signal automatically
Eval generation from production — Turning failed production traces directly into eval cases, closing the loop between real usage and test coverage
Trace-based version comparison — A/B testing harness versions against real traffic, using trace quality as the outcome metric

All three of these push toward a system where the harness improves continuously from production, with humans in the loop for edge cases and direction-setting, not routine scoring.

RDCO Application

Formalize Skill Evals

We already have an eval structure. The skill-creator plugin gave us the template: test prompts, baseline comparison, grader subagent, HTML eval viewer. The research-brief skill has a live evals/evals.json with three hand-curated cases covering DWH caching, context engineering, and the narrowing funnel angle.

What we lack is the holdout discipline. All three research-brief evals are in the optimization pool — there’s no held-out set we can’t look at. Before the next iteration of any skill, split: put at least one eval in holdout and commit to not using it for optimization.

Next step: when building or significantly revising a skill, enforce the split before touching the harness. Write the holdout cases first, lock them, then iterate.

Mine the Vault Changelog for Failure Patterns

SOUL.md mandates that significant vault changes go in log.md and major decisions in decisions.md. That log is a production trace. The entries where we document a skill that misfired, a prompt that needed revision, or an edge case we hadn’t anticipated are exactly the raw material for hand-curated evals.

Right now those failure patterns live in prose in the log. The next maturity step is tagging them by skill and failure type so they can feed an eval set directly. A lightweight convention — a #fail::skill-name tag in log entries — would make this mechanical.

Holdout Testing for Skills Before Shipping

The current skill development process (informed by skill-creator) generates test prompts during creation and runs them before shipping. But there’s no holdout discipline, and there’s no systematic tracking of whether a change improved or degraded performance on unseen inputs.

Proposal: for any skill with >= 3 evals, require a train/holdout split before any harness change. The holdout score gates the merge. One change at a time during optimization. Human review of holdout outputs before calling a version “shipped.”

This is Level 4 discipline (Four Levels of AI Use) — we’re not just building custom tools, we’re building the infrastructure to improve them systematically. That’s the compounding moat.

Consulting Application (phData)

The Better-Harness loop is a deliverable, not just a methodology. Clients building internal agents don’t know what they don’t know about failure modes — they ship a harness, watch it fail intermittently, and patch reactively. The eval framework gives a consulting engagement a systematic improvement process that clients can own after handoff.

A phData agent engagement structured around this methodology would look like:

Discovery — identify the three eval sources available to the client (hand-curated from SMEs, production traces from existing agent usage or similar workflows, external benchmarks if applicable)
Baseline — instrument the current harness and score it across the optimization eval set
Sprint cycles — harness engineering sprints with one-change-at-a-time discipline, holdout gate at the end of each sprint
Handoff — the client receives the eval infrastructure, not just the harness; they can continue the loop independently

The differentiator over “we’ll build you an agent” is “we’ll build you an agent and the process for improving it.” That’s a retainer conversation, not just a project. See 01-projects/phdata/index for the broader consulting positioning.

This pairs naturally with the Compound Engineering framework (Compound Engineering): each sprint’s compound step is updating the eval set with failure patterns from that sprint, not just updating the harness. The eval set compounds alongside the harness.

Connection to cc-wrapped (Mammoth Growth Prior Work)

Mammoth Growth is where the instinct for agentic tooling was built — the 1099 Technical Architect exposure to diverse client stacks created the pattern recognition for what “custom tools only you would ever build” actually looks like in practice.

The Better-Harness loop is the systematic version of what was being done intuitively when building agentic workflows at the client level: try something, observe output, adjust the prompt or tool description, try again. What Viv’s methodology adds is the holdout discipline and the one-change-at-a-time constraint that turns intuition into a reproducible process. The artifact we didn’t formalize then — the eval set and the trace log — is exactly what we should formalize now.

Vault Connections

06-reference/2026-04-08-four-levels-of-ai-use — Level 4 is building custom tools; this article is how to improve them systematically. The eval loop is what separates Level 4 from guessing.
06-reference/2026-04-04-compound-engineering — The Compound Engineering loop (Plan > Work > Review > Compound) maps directly. Evals are the compound artifact for harness work — they make every future iteration faster and more reliable.
06-reference/2026-04-07-claude-code-architecture-teardown — The error recovery state machine (microcompact → snip → auto-compact → context collapse) is itself a kind of holdout test: the system was built knowing the harness would fail under certain conditions, and those failure modes drove the architecture. Same epistemology.
06-reference/2026-04-04-superpowers-plugin-analysis — The skill-creator plugin implements a version of this loop. Grader-critiques-evals pattern; 60/40 train/test split for description optimization; baseline comparison baked in. The article gives the theory; the plugin gives a working implementation.
SOUL.md — The RDCO operating model already has the compound engineering loop in its DNA: skills, loops, dedicated instances, agent teams. The eval discipline is what closes the loop from “built a skill” to “demonstrably improved a skill.”
06-reference/2026-04-04-recursive-self-improvement-marketing — The recursive self-improvement loop (generate → evaluate → diagnose → rewrite → repeat) is the same hill-climbing logic applied to marketing output. Evals in that context are the scoring criteria against which each revision is measured.