06-reference

langchain evals deep agents

Wed Mar 25 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: LangChain Blog ·by Mason Daugherty (LangChain)

“How We Build Evals for Deep Agents” — Mason Daugherty

Why this is in the vault

LangChain’s eval methodology for Deep Agents, their open-source agent harness. This matters for RDCO because we’re building an /improve skill that does qualitative diagnosis and want to integrate quantitative evals. The article lays out a framework for targeted agent evaluation that parallels — and in some areas extends — the eval capabilities we analyzed in the Anthropic skill-creator plugin.

Core arguments

1. Targeted evals over benchmark accumulation

The central thesis: more evals do not produce better agents. Each eval is a directional vector that shapes agent behavior over time, so poorly chosen evals push the system in unproductive directions. The article advocates designing evals that directly measure desired production behaviors rather than chasing generic benchmark coverage.

2. Four-source eval pipeline

Daugherty describes four sources for evaluation cases: (1) dogfooding — using the agent internally and cataloging failure modes via LangSmith traces, (2) trace analysis using specialized agents (Polly, Insights) to identify patterns in production data, (3) adapting external benchmarks (Terminal Bench 2.0, BFCL, FRAMES, Harbor), and (4) hand-crafted “artisanal” evals for specific behavioral targets. Cases are then categorized by capability tested — file operations, retrieval, tool use, memory, conversation, summarization, unit tests — rather than by source.

3. Efficiency metrics alongside correctness

The most technically interesting contribution. Beyond binary correctness, they define an “ideal trajectory” — the minimum sequence of steps that produces a correct outcome — and measure the agent against it using three ratios: step ratio (observed / ideal steps), tool call ratio (observed / ideal tool calls), and latency ratio (observed / ideal time). A composite “solve rate” metric normalizes expected steps by observed latency, yielding a single efficiency number. This addresses the real-world problem where two models can both solve a task but one burns five times the tokens doing it.

4. CI-integrated execution

Evals run via pytest on GitHub Actions. Tag-based filtering lets developers run subsets for targeted experiments rather than the full suite on every commit.

Assessment

Strengths:

Bias flags:

What it doesn’t cover: No discussion of qualitative diagnosis — why an agent fails, not just that it fails. The eval framework is purely quantitative measurement, with no feedback loop for improving the agent based on eval results.

RDCO mapping