“Externalization in LLM Agents” — Zhou et al. (arxiv 2604.08224)

Why this is in the vault

Academic validation of the harness thesis. This survey paper traces the same historical progression that practitioners like Garry Tan, Harrison Chase, and Cobus Greyling are describing from experience — but formalizes it with a structured taxonomy. The paper’s framing of “externalization” (capabilities moving from inside the model to the runtime around it) is the cleanest academic articulation of why harness engineering is now the dominant concern.

Paper details

Title: Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Key claim: LLM agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into four categories:

Memory stores — persistent state that outlives a single call
Reusable skills — packaged procedures the agent can invoke
Interaction protocols — standards for how agents communicate with tools, users, and each other
Harness engineering — the surrounding program that orchestrates model calls, manages context, and enforces safety

Core contribution

The paper positions these four categories as interconnected forms of the same underlying trend: externalization. It traces a historical progression:

Weights — early approach: bake capability into model parameters
Context — middle period: feed capability in via prompts and retrieval
Harness — current: build capability into the orchestration layer

It analyzes trade-offs between parametric (internal) and externalized capability, and identifies emerging directions including self-evolving harnesses and shared agent infrastructure.

Assessment

Strengths:

Provides a unified vocabulary across memory, skills, protocols, and harness — useful for Sanity Check content that needs precise terminology
The externalization framing is elegant and maps cleanly to practitioner language
Large author team suggests broad literature coverage

Limitations:

Survey papers by nature lag practitioner reality; the taxonomy may already be incomplete given how fast the harness space moves
No original experiments — this is synthesis, not new evidence

Bias flags: None obvious. Academic survey, no commercial affiliation declared in the author list.

RDCO mapping

Sanity Check utility: Use as the academic spine for “The Harness Era” article. The externalization framing is more precise than “things moved outward” — it names what moved and why.
Vocabulary alignment: The paper’s four-category taxonomy (memory, skills, protocols, harness) maps almost exactly to Garry Tan’s framework and to the RDCO agent architecture.
Cross-reference: Greyling’s three-layer timeline (weights/context/harness) appears to be derived from or inspired by this paper’s historical analysis.