Latent Briefing — Ramp Labs on efficient multi-agent memory via KV cache compaction

Why this is in the vault

Founder flagged this as “good content for framing the agentic enablement and custom harness development.” Directly relevant to the 5-agent automated investing architecture we’re building toward, and to the broader Ray Data Co question of how to make multi-agent workflows cost-effective as they scale. The research is from Ramp (the spend management company) whose engineering blog often runs actually rigorous papers rather than marketing.

The problem they solve

Multi-agent systems — where an orchestrator decomposes a task and calls worker agents — have a token explosion problem. Each call to a worker requires passing context. Verbose orchestrator reasoning accumulates across calls. The worker only sees what the orchestrator explicitly passes it, often a narrow slice of context that misses important cross-reference information the orchestrator has already discovered.

Existing fixes all have trade-offs:

LLM summarization: 20-60s latency per step, lossy, summary may not capture what the subtask needs
RAG / retrieval: requires chunking and embedding, misses cross-chunk dependencies
Pass everything: expensive, slow, accuracy degrades with irrelevant context

What they built

Latent Briefing: operate directly on the worker model’s KV cache rather than on text. When the orchestrator calls the worker, the worker’s forward pass computes attention scores between the orchestrator’s task prompt and the accumulated trajectory. Those scores identify which parts of the trajectory the worker considers relevant to this specific task. The irrelevant parts are discarded at the representation level before the worker generates its answer.

Three key modifications to the existing Attention Matching (AM) compaction framework:

Task-guided query vectors. Instead of sampling queries from the context itself, they use queries derived from the orchestrator’s task prompt. This means compaction preserves the parts of the trajectory that are most relevant to this particular worker call, not generally important.
Shared token selection via global scoring. Instead of each attention head independently selecting its own top-t keys (which blocks GPU batching), they aggregate scores across all layers and heads into a single per-position relevance score. A single shared mask lets them batch all solves into one tensor operation.
MAD-normalized thresholding. Instead of a fixed top-k, they use a statistically-derived cutoff (median absolute deviation): keep every position that scores above median + threshold * MAD. More robust to outliers than top-k, and the threshold parameter naturally controls aggressiveness across different context lengths.

Results on LongBench v2

126 questions, Claude Sonnet 4 as orchestrator + Qwen3-14B as worker:

Up to 49% median token savings on medium-length (32k-100k) documents
65% reduction in worker model token consumption at best threshold
+3 percentage points accuracy over baseline at optimal compaction (doesn’t hurt accuracy, slightly improves it)
~1.7s median compaction overhead, scaling linearly with trajectory length — small fraction of overall call cost

The original AM framework took 30+ seconds to compact a cache (sequential per-head processing on an A100). Ramp’s batched approach brings it down to ~1.7s median, making it viable for real-time agent workloads.

Threshold behaves differently across regimes

Interesting empirical finding about when to compact aggressively vs. lightly:

Longer documents → lighter compaction wins (t=-1.0, 18% compaction). Longer docs have more dispersed information; light compaction preserves broad coverage.
Harder questions → aggressive compaction wins (t=2.0, 79% compaction). Hard questions cause the orchestrator to explore many hypotheses, generating speculative reasoning that dilutes the worker’s signal. Aggressive compaction acts as a relevance filter.
Short, easy documents → moderate compaction (t=1.0, 68%). Orchestrator trajectory is already focused.

The author’s analogy: sometimes you’re building a body of knowledge where details accumulate into something larger (keep them); sometimes you’re sketching and most of what gets written isn’t meant to last (aggressive compaction).

Limitations (author’s own)

Orchestrator variance. Claude Sonnet 4 is non-deterministic, so different decomposition strategies across runs for the same question. With n=42 per condition, individual results are noisy.
Single benchmark. Only tested on LongBench v2. Other task types (code gen, multi-document synthesis, math) may have different attention patterns.

Why it matters for Ray Data Co

For the automated investing 5-agent vision: as we grow from a single-process single-agent setup into a Strategy Research / Paper Testing / Execution / Monitor / Reporting split, cross-agent context cost will become a bottleneck. The founder’s concern about margins being eaten by API costs applies here too — if each strategy research cycle requires passing the entire project history between agents, token cost grows quadratically in session length. Latent Briefing is the kind of primitive that would let us scale multi-agent workflows without the cost explosion.

For the broader Ray Data Co thesis of “building products for agents”: this is a primitive that agent harness vendors will want. If we end up building infrastructure for other people’s agents (MCP servers, harness plugins), efficient cross-agent memory sharing is a real need. Worth tracking as a building block.

Practical caveats before we’d adopt it:

Requires KV cache access at the model level — only works with open models we run ourselves (Qwen3-14B in their setup), not with hosted APIs like Claude’s current endpoint
Inference infrastructure requirement (GPU with enough memory for the KV cache) is a meaningful operational cost
The technique is ~1 month old; no third-party replications yet

Alternatives already in use

Pattern in Claude Code / autoinv today is lightweight bridge notes (see ../.claude-code/state/working-context.md pattern): Ray maintains a small markdown scratchpad that survives compaction via PreCompact/SessionStart hooks. That’s a primitive version of the same idea — preserve the relevant context across session boundaries — but at the text level, not the KV cache level. Works because we’re orchestrator + human, not orchestrator + worker agents.

../01-projects/automated-investing/architecture-vision — the 5-agent target this paper informs
2026-04-10-akshay-pachaar-agent-harness-anatomy — companion article on agent harness architecture
../01-projects/automated-investing/experiments/consolidation-pass — the single-threaded starting point we’re building on
Paper: https://arxiv.org/abs/2512.24601

Tracked author

../03-contacts — consider adding Ben Geist (@b_geist) and Ramp Labs (@RampLabs) to the CRM when we open that task (#4). Ramp Labs is publishing substantive research, worth following.