06-reference

paper arxiv 2604 08224 agent harness study 2026 04 12

Sat Apr 11 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: arxiv preprint ·by Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, Zeyu Zheng, Zhuosheng Zhang, Xingyu Lou, Changwang Zhang, Zhihui Fu, Jun Wang, Weiwen Liu, Jianghao Lin, Weinan Zhang

“Externalization in LLM Agents” — Zhou et al. (arxiv 2604.08224)

Why this is in the vault

Academic validation of the harness thesis. This survey paper traces the same historical progression that practitioners like Garry Tan, Harrison Chase, and Cobus Greyling are describing from experience — but formalizes it with a structured taxonomy. The paper’s framing of “externalization” (capabilities moving from inside the model to the runtime around it) is the cleanest academic articulation of why harness engineering is now the dominant concern.

Paper details

Title: Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Key claim: LLM agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into four categories:

  1. Memory stores — persistent state that outlives a single call
  2. Reusable skills — packaged procedures the agent can invoke
  3. Interaction protocols — standards for how agents communicate with tools, users, and each other
  4. Harness engineering — the surrounding program that orchestrates model calls, manages context, and enforces safety

Core contribution

The paper positions these four categories as interconnected forms of the same underlying trend: externalization. It traces a historical progression:

It analyzes trade-offs between parametric (internal) and externalized capability, and identifies emerging directions including self-evolving harnesses and shared agent infrastructure.

Assessment

Strengths:

Limitations:

Bias flags: None obvious. Academic survey, no commercial affiliation declared in the author list.

RDCO mapping