“How We Build Evals for Deep Agents” — Mason Daugherty
Why this is in the vault
LangChain’s eval methodology for Deep Agents, their open-source agent harness. This matters for RDCO because we’re building an /improve skill that does qualitative diagnosis and want to integrate quantitative evals. The article lays out a framework for targeted agent evaluation that parallels — and in some areas extends — the eval capabilities we analyzed in the Anthropic skill-creator plugin.
Core arguments
1. Targeted evals over benchmark accumulation
The central thesis: more evals do not produce better agents. Each eval is a directional vector that shapes agent behavior over time, so poorly chosen evals push the system in unproductive directions. The article advocates designing evals that directly measure desired production behaviors rather than chasing generic benchmark coverage.
2. Four-source eval pipeline
Daugherty describes four sources for evaluation cases: (1) dogfooding — using the agent internally and cataloging failure modes via LangSmith traces, (2) trace analysis using specialized agents (Polly, Insights) to identify patterns in production data, (3) adapting external benchmarks (Terminal Bench 2.0, BFCL, FRAMES, Harbor), and (4) hand-crafted “artisanal” evals for specific behavioral targets. Cases are then categorized by capability tested — file operations, retrieval, tool use, memory, conversation, summarization, unit tests — rather than by source.
3. Efficiency metrics alongside correctness
The most technically interesting contribution. Beyond binary correctness, they define an “ideal trajectory” — the minimum sequence of steps that produces a correct outcome — and measure the agent against it using three ratios: step ratio (observed / ideal steps), tool call ratio (observed / ideal tool calls), and latency ratio (observed / ideal time). A composite “solve rate” metric normalizes expected steps by observed latency, yielding a single efficiency number. This addresses the real-world problem where two models can both solve a task but one burns five times the tokens doing it.
4. CI-integrated execution
Evals run via pytest on GitHub Actions. Tag-based filtering lets developers run subsets for targeted experiments rather than the full suite on every commit.
Assessment
Strengths:
- The ideal-trajectory / efficiency-ratio framework is concrete and implementable. It gives teams a way to compare models on cost and speed, not just accuracy.
- The four-source pipeline (dogfood, trace analysis, external benchmarks, artisanal) is a practical taxonomy for building eval suites.
- Categorizing evals by capability rather than origin is a smart organizational choice that prevents blind spots.
Bias flags:
- Commercial interest. Written by a LangChain employee about a LangChain product (Deep Agents), using LangChain’s observability platform (LangSmith) as the instrumentation layer. The methodology is sound, but every technical recommendation conveniently routes through LangChain’s product suite.
- Harrison Chase (LangChain CEO) is already in our vault with the “Your Harness, Your Memory” article — the strategic pattern is consistent: publish open-source methodology that happens to require LangChain tooling for full execution.
What it doesn’t cover: No discussion of qualitative diagnosis — why an agent fails, not just that it fails. The eval framework is purely quantitative measurement, with no feedback loop for improving the agent based on eval results.
RDCO mapping
/improveskill integration: Their efficiency ratios (step, tool call, latency) map directly to the quantitative layer we want to bolt onto/improve’s qualitative diagnosis. Our designed loop — diagnose, baseline, apply change, re-run, compare, commit if improved — could use their ideal-trajectory concept as the baseline anchor.- Skill-creator eval parallel: The Anthropic skill-creator plugin we analyzed does parallel A/B runs, quantitative assertions, benchmark aggregation, and blind comparison. LangChain’s approach adds the efficiency-ratio dimension we hadn’t considered — measuring not just whether a skill change improves correctness but whether it improves token/step efficiency.
- Chase continuity: This article operationalizes what Chase’s “Your Harness, Your Memory” argued architecturally. Chase said harnesses are permanent and growing; Daugherty shows the eval infrastructure that permanent harnesses require. The product narrative is coherent: open harness → observable via LangSmith → evaluated via this methodology.
- Sanity Check angle: The “every eval is a vector” framing is strong newsletter material. Pair with the skill-creator eval capabilities for a piece on building self-improving AI systems.
Related
- 2026-04-12-harrison-chase-harness-blog
- 2026-04-11-garry-tan-thin-harness-fat-skills
- cross-check-agent-architecture