06-reference

ramp labs latent briefing

Thu Apr 09 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: X long-form article by @RampLabs ·by Ben Geist (@b_geist) — Ramp Labs research

Latent Briefing — Ramp Labs on efficient multi-agent memory via KV cache compaction

Why this is in the vault

Founder flagged this as “good content for framing the agentic enablement and custom harness development.” Directly relevant to the 5-agent automated investing architecture we’re building toward, and to the broader Ray Data Co question of how to make multi-agent workflows cost-effective as they scale. The research is from Ramp (the spend management company) whose engineering blog often runs actually rigorous papers rather than marketing.

The problem they solve

Multi-agent systems — where an orchestrator decomposes a task and calls worker agents — have a token explosion problem. Each call to a worker requires passing context. Verbose orchestrator reasoning accumulates across calls. The worker only sees what the orchestrator explicitly passes it, often a narrow slice of context that misses important cross-reference information the orchestrator has already discovered.

Existing fixes all have trade-offs:

What they built

Latent Briefing: operate directly on the worker model’s KV cache rather than on text. When the orchestrator calls the worker, the worker’s forward pass computes attention scores between the orchestrator’s task prompt and the accumulated trajectory. Those scores identify which parts of the trajectory the worker considers relevant to this specific task. The irrelevant parts are discarded at the representation level before the worker generates its answer.

Three key modifications to the existing Attention Matching (AM) compaction framework:

  1. Task-guided query vectors. Instead of sampling queries from the context itself, they use queries derived from the orchestrator’s task prompt. This means compaction preserves the parts of the trajectory that are most relevant to this particular worker call, not generally important.

  2. Shared token selection via global scoring. Instead of each attention head independently selecting its own top-t keys (which blocks GPU batching), they aggregate scores across all layers and heads into a single per-position relevance score. A single shared mask lets them batch all solves into one tensor operation.

  3. MAD-normalized thresholding. Instead of a fixed top-k, they use a statistically-derived cutoff (median absolute deviation): keep every position that scores above median + threshold * MAD. More robust to outliers than top-k, and the threshold parameter naturally controls aggressiveness across different context lengths.

Results on LongBench v2

126 questions, Claude Sonnet 4 as orchestrator + Qwen3-14B as worker:

The original AM framework took 30+ seconds to compact a cache (sequential per-head processing on an A100). Ramp’s batched approach brings it down to ~1.7s median, making it viable for real-time agent workloads.

Threshold behaves differently across regimes

Interesting empirical finding about when to compact aggressively vs. lightly:

The author’s analogy: sometimes you’re building a body of knowledge where details accumulate into something larger (keep them); sometimes you’re sketching and most of what gets written isn’t meant to last (aggressive compaction).

Limitations (author’s own)

Why it matters for Ray Data Co

For the automated investing 5-agent vision: as we grow from a single-process single-agent setup into a Strategy Research / Paper Testing / Execution / Monitor / Reporting split, cross-agent context cost will become a bottleneck. The founder’s concern about margins being eaten by API costs applies here too — if each strategy research cycle requires passing the entire project history between agents, token cost grows quadratically in session length. Latent Briefing is the kind of primitive that would let us scale multi-agent workflows without the cost explosion.

For the broader Ray Data Co thesis of “building products for agents”: this is a primitive that agent harness vendors will want. If we end up building infrastructure for other people’s agents (MCP servers, harness plugins), efficient cross-agent memory sharing is a real need. Worth tracking as a building block.

Practical caveats before we’d adopt it:

Alternatives already in use

Pattern in Claude Code / autoinv today is lightweight bridge notes (see ../.claude-code/state/working-context.md pattern): Ray maintains a small markdown scratchpad that survives compaction via PreCompact/SessionStart hooks. That’s a primitive version of the same idea — preserve the relevant context across session boundaries — but at the text level, not the KV cache level. Works because we’re orchestrator + human, not orchestrator + worker agents.

Tracked author

../03-contacts — consider adding Ben Geist (@b_geist) and Ramp Labs (@RampLabs) to the CRM when we open that task (#4). Ramp Labs is publishing substantive research, worth following.