Latent Briefing — Ramp Labs on efficient multi-agent memory via KV cache compaction
Why this is in the vault
Founder flagged this as “good content for framing the agentic enablement and custom harness development.” Directly relevant to the 5-agent automated investing architecture we’re building toward, and to the broader Ray Data Co question of how to make multi-agent workflows cost-effective as they scale. The research is from Ramp (the spend management company) whose engineering blog often runs actually rigorous papers rather than marketing.
The problem they solve
Multi-agent systems — where an orchestrator decomposes a task and calls worker agents — have a token explosion problem. Each call to a worker requires passing context. Verbose orchestrator reasoning accumulates across calls. The worker only sees what the orchestrator explicitly passes it, often a narrow slice of context that misses important cross-reference information the orchestrator has already discovered.
Existing fixes all have trade-offs:
- LLM summarization: 20-60s latency per step, lossy, summary may not capture what the subtask needs
- RAG / retrieval: requires chunking and embedding, misses cross-chunk dependencies
- Pass everything: expensive, slow, accuracy degrades with irrelevant context
What they built
Latent Briefing: operate directly on the worker model’s KV cache rather than on text. When the orchestrator calls the worker, the worker’s forward pass computes attention scores between the orchestrator’s task prompt and the accumulated trajectory. Those scores identify which parts of the trajectory the worker considers relevant to this specific task. The irrelevant parts are discarded at the representation level before the worker generates its answer.
Three key modifications to the existing Attention Matching (AM) compaction framework:
-
Task-guided query vectors. Instead of sampling queries from the context itself, they use queries derived from the orchestrator’s task prompt. This means compaction preserves the parts of the trajectory that are most relevant to this particular worker call, not generally important.
-
Shared token selection via global scoring. Instead of each attention head independently selecting its own top-t keys (which blocks GPU batching), they aggregate scores across all layers and heads into a single per-position relevance score. A single shared mask lets them batch all solves into one tensor operation.
-
MAD-normalized thresholding. Instead of a fixed top-k, they use a statistically-derived cutoff (median absolute deviation): keep every position that scores above
median + threshold * MAD. More robust to outliers than top-k, and the threshold parameter naturally controls aggressiveness across different context lengths.
Results on LongBench v2
126 questions, Claude Sonnet 4 as orchestrator + Qwen3-14B as worker:
- Up to 49% median token savings on medium-length (32k-100k) documents
- 65% reduction in worker model token consumption at best threshold
- +3 percentage points accuracy over baseline at optimal compaction (doesn’t hurt accuracy, slightly improves it)
- ~1.7s median compaction overhead, scaling linearly with trajectory length — small fraction of overall call cost
The original AM framework took 30+ seconds to compact a cache (sequential per-head processing on an A100). Ramp’s batched approach brings it down to ~1.7s median, making it viable for real-time agent workloads.
Threshold behaves differently across regimes
Interesting empirical finding about when to compact aggressively vs. lightly:
- Longer documents → lighter compaction wins (t=-1.0, 18% compaction). Longer docs have more dispersed information; light compaction preserves broad coverage.
- Harder questions → aggressive compaction wins (t=2.0, 79% compaction). Hard questions cause the orchestrator to explore many hypotheses, generating speculative reasoning that dilutes the worker’s signal. Aggressive compaction acts as a relevance filter.
- Short, easy documents → moderate compaction (t=1.0, 68%). Orchestrator trajectory is already focused.
The author’s analogy: sometimes you’re building a body of knowledge where details accumulate into something larger (keep them); sometimes you’re sketching and most of what gets written isn’t meant to last (aggressive compaction).
Limitations (author’s own)
- Orchestrator variance. Claude Sonnet 4 is non-deterministic, so different decomposition strategies across runs for the same question. With n=42 per condition, individual results are noisy.
- Single benchmark. Only tested on LongBench v2. Other task types (code gen, multi-document synthesis, math) may have different attention patterns.
Why it matters for Ray Data Co
For the automated investing 5-agent vision: as we grow from a single-process single-agent setup into a Strategy Research / Paper Testing / Execution / Monitor / Reporting split, cross-agent context cost will become a bottleneck. The founder’s concern about margins being eaten by API costs applies here too — if each strategy research cycle requires passing the entire project history between agents, token cost grows quadratically in session length. Latent Briefing is the kind of primitive that would let us scale multi-agent workflows without the cost explosion.
For the broader Ray Data Co thesis of “building products for agents”: this is a primitive that agent harness vendors will want. If we end up building infrastructure for other people’s agents (MCP servers, harness plugins), efficient cross-agent memory sharing is a real need. Worth tracking as a building block.
Practical caveats before we’d adopt it:
- Requires KV cache access at the model level — only works with open models we run ourselves (Qwen3-14B in their setup), not with hosted APIs like Claude’s current endpoint
- Inference infrastructure requirement (GPU with enough memory for the KV cache) is a meaningful operational cost
- The technique is ~1 month old; no third-party replications yet
Alternatives already in use
Pattern in Claude Code / autoinv today is lightweight bridge notes (see ../.claude-code/state/working-context.md pattern): Ray maintains a small markdown scratchpad that survives compaction via PreCompact/SessionStart hooks. That’s a primitive version of the same idea — preserve the relevant context across session boundaries — but at the text level, not the KV cache level. Works because we’re orchestrator + human, not orchestrator + worker agents.
Related
- ../01-projects/automated-investing/architecture-vision — the 5-agent target this paper informs
- 2026-04-10-akshay-pachaar-agent-harness-anatomy — companion article on agent harness architecture
- ../01-projects/automated-investing/experiments/consolidation-pass — the single-threaded starting point we’re building on
- Paper: https://arxiv.org/abs/2512.24601
Tracked author
../03-contacts — consider adding Ben Geist (@b_geist) and Ramp Labs (@RampLabs) to the CRM when we open that task (#4). Ramp Labs is publishing substantive research, worth following.