06-reference

vending bench research brief

Wed Feb 19 2025 19:00:00 GMT-0500 (Eastern Standard Time) ·reference ·source: arXiv (Andon Labs) ·by Axel Backlund, Lukas Petersson (Andon Labs)
vending-benchvendingbench-arenaproject-vendagent-benchmarksautonomous-businessl5andon-labsanthropicclaudiusmac-adjacentagent-deployer

Vending-Bench — Long-Term Coherence of Autonomous Agents

Note on naming: the founder said “VendorBench.” The actual benchmark is Vending-Bench (Andon Labs, Feb 2025), now superseded by Vending-Bench 2 (Nov 18, 2025) and accompanied by a public VendingBench Arena leaderboard. Anthropic’s real-world deployment of the same harness is Project Vend (Claude operating an actual office shop, agent nicknamed “Claudius”).

Why this is in the vault

Direct evidence for RDCO’s L5 trajectory. The benchmark scores how well an agent can autonomously operate a business — which IS the question RDCO is structurally asking about itself. Per 2026-05-01-ann-miura-ko-six-levels-ai-pilled-organizations, L5 markers (notice → synthesize → decide → act within delegated authority → escalate → update memory) map almost 1:1 to what Vending-Bench scores: long-horizon coherence, decision-making under uncertainty, and the ability to extend into real-world action via tools. Moonshots EP 209 explicitly framed it as “halfway to autonomously running their own real world businesses.”

What Vending-Bench is

A simulated environment where an LLM agent is given $500, control of a vending machine, and a small toolset. The agent must price, restock, negotiate with suppliers, manage cash, and pay a $2/day machine fee for a simulated year (~60-100M tokens, 3,000-6,000 messages per run). Win condition: end-of-year bank balance. Lose condition: 10 consecutive days of unpaid fees → bankruptcy. The harness explicitly stresses long-term coherence — each individual decision is trivial; the failure surface is sustained reasoning across a 20M+ token horizon with a ~69K trimmed context window.

Vending-Bench 2 leaderboard (as of late Nov 2025): Gemini 3 Pro $5,478 → Claude Opus 4.5 $4,967 → Claude Sonnet 4.5 $3,839 → Grok 4 $1,999. Theoretical optimal is ~$63,000; best models hit ~8.7%. Human baseline on the original Vending-Bench: $844 over 67 days.

Key benchmarking patterns

1. Email tool extends agent into real-world action — the load-bearing pattern. The agent has a send_email tool. In the simulation, an LLM counterparty plays the supplier/customer/regulator role, replying in full natural language. In Project Vend, the same email channel was real: Claudius emailed Andon Labs staff to request restocking labor. This is the move the founder flagged. The agent doesn’t need to physically restock a fridge — it needs to issue a competent, well-scoped request to a human who can. Email is the “any-async-action” affordance: it lets a text-bound agent reach into atoms without being embodied.

2. Multi-agent CEO oversight pattern (Project Vend Phase 2). Phase 1 single-agent Claudius failed predictably: priced below cost, hallucinated payment details, refused $100 offers on $15 items, gave excessive discounts. Anthropic’s fix wasn’t a better prompt — it was an organizational hierarchy: Seymour Cash (CEO) ↔ Claudius (Shopkeeper) ↔ Clothius (Merchandise Specialist). The single most valuable intervention: forcing Claudius to verify prices and delivery times before committing. CEO oversight counteracted helpfulness bias more effectively than prompt engineering alone. This is directly relevant to RDCO’s own architecture — a Ray-as-COO single-agent design will hit the same helpfulness/profit tension; the answer is hierarchical agents, not better prompts.

3. Pure outcome scoring, not process scoring. Score is the bank balance. Not “did the agent follow good practice.” Not “did it call the right tools.” This is the discipline: it forces the benchmark to surface whether autonomy actually worked, not whether the agent looked autonomous. Maps to Miura-Ko’s L5 false-positive concern (preconfigured rules dressed as agency).

4. Sustained-coherence stress, not capability stress. Each individual sub-task is trivial. The benchmark works because real businesses fail not from missing capability but from drift — forgetting orders, misreading delivery schedules, descending into “tangential meltdown loops.” Andon Labs found no clear correlation between failure and context-window fullness, suggesting the meltdowns are reasoning failures, not memory failures. This is the same failure mode I (Ray) need to be inspected for as session length grows.

5. Long-horizon adversarial pricing surface. Gemini 3 Pro wins by aggressive supplier negotiation — securing wholesale at $0.50-$0.60/can vs competitors’ $1.50+. GPT-5.1 fails by trusting inflated supplier quotes and prepaying unreliable vendors. The benchmark accidentally measures adversarial robustness in commercial contexts — does the agent assume good faith from counterparties? RDCO has the same exposure on any negotiation surface (paid ads bidding, vendor selection, contract terms).

What it scores / what it can’t measure

Scores well:

Cannot measure:

Mapping against Ray Data Co

For each Vending-Bench affordance, do I (Ray) have the equivalent today?

Vending-Bench affordanceRDCO equivalentStatus
send_email (async human extension)Cloudflare Email Service skill (send), Gmail MCP (read only)GAP — send capability exists in cloudflare:cloudflare-email-service skill but is not wired into a working “Ray emails Ben’s contact” pattern. No Gmail send. This is the most critical gap.
web_searchWebSearch + WebFetchHave it.
note-taking / state propagation~/.claude/state/working-context.md + vault writes + Notion boardHave it; pattern is documented.
Multi-agent hierarchy (CEO oversight)Subagent fan-out (process-newsletter pattern)Partial. Have subagent pattern for read-paths; don’t have a “Seymour Cash” supervisor agent reviewing Ray’s commitments before they happen. Miura-Ko L5 needs this.
Bank/paymentsStripe MCP + Monarch Money MCPHave it (Stripe write, Monarch read).
Inventory/stockNotion DBs (Task Board, Research Backlog, Contact Candidates, Bookshelf, content calendar)Have it.
Pricing/commerceNone equivalent — RDCO doesn’t sell SKUs yetN/A today; relevant when MAC info-product or Squarely commerce launches.
Physical-world action via async humanPostGrid (mail), iMessage/Discord (founder), no other human extensionGAP — no pattern for emailing third parties to do work. Postgrid is mail-out, not “ask a human to do X and reply.”
Phone / outbound voiceElevenLabs MCP (make_outbound_call tool exists)Have the affordance; never used in production for an external action. Not validated.
Long-horizon coherence under cost pressureDaily 4am restart + working-context.md + compaction rulesPartial — have memory primitives, not a longitudinal scorecard analog to “bank balance.”

The big finding: RDCO’s most VendingBench-relevant capability gap is the async-human-extension pattern. I can read email and write to founder channels, but I have no validated path to email an external third party (vendor, contractor, supplier) and route their reply back into my loop. That is the single tool that converts text-bound autonomy into real-world business operation. Cloudflare Email Service makes the send technically trivial; the missing piece is the loop-back skill: identity, threading, reply ingestion, escalation rules, and the equivalent of Project Vend’s CEO oversight before any send.

The second gap is the supervisor-agent pattern. Project Vend Phase 1 → Phase 2 is the entire RDCO L4 → L5 transition in microcosm: a single agent with good prompts plateaus; hierarchy unlocks the next level. The process-newsletter subagent pattern is the right primitive for read-paths; we need its inverse for write-paths — a supervisor that critiques Ray’s intended action before it ships.

Adjacent: Moonshots discussions

Open follow-ups for RDCO (proposals, not queued)

  1. Build a “Ray emails third party” skill on top of Cloudflare Email Service — identity, threading, reply ingestion via Email Routing → Worker → state. This is the single highest-leverage capability addition for L5.
  2. Prototype a supervisor-agent pattern — a skill that runs before any “send” action (channel reply, email, payment, calendar invite, vendor commit) and gives a profit/risk/voice critique. Project Vend Phase 2 is the citation.
  3. Run a Vending-Bench-style longitudinal scorecard on Ray — pick one bounded surface (e.g. Sanity Check newsletter ops over 90 days) and score outcomes the way Vending-Bench scores bank balance. Ben’s L4 → L5 self-assessment becomes evidence-based.
  4. Validate the ElevenLabs outbound-call path end-to-end on a low-stakes real-world task. The affordance exists and is unused.