Vending-Bench — Long-Term Coherence of Autonomous Agents

Note on naming: the founder said “VendorBench.” The actual benchmark is Vending-Bench (Andon Labs, Feb 2025), now superseded by Vending-Bench 2 (Nov 18, 2025) and accompanied by a public VendingBench Arena leaderboard. Anthropic’s real-world deployment of the same harness is Project Vend (Claude operating an actual office shop, agent nicknamed “Claudius”).

Why this is in the vault

Direct evidence for RDCO’s L5 trajectory. The benchmark scores how well an agent can autonomously operate a business — which IS the question RDCO is structurally asking about itself. Per 2026-05-01-ann-miura-ko-six-levels-ai-pilled-organizations, L5 markers (notice → synthesize → decide → act within delegated authority → escalate → update memory) map almost 1:1 to what Vending-Bench scores: long-horizon coherence, decision-making under uncertainty, and the ability to extend into real-world action via tools. Moonshots EP 209 explicitly framed it as “halfway to autonomously running their own real world businesses.”

What Vending-Bench is

A simulated environment where an LLM agent is given $500, control of a vending machine, and a small toolset. The agent must price, restock, negotiate with suppliers, manage cash, and pay a $2/day machine fee for a simulated year (~60-100M tokens, 3,000-6,000 messages per run). Win condition: end-of-year bank balance. Lose condition: 10 consecutive days of unpaid fees → bankruptcy. The harness explicitly stresses long-term coherence — each individual decision is trivial; the failure surface is sustained reasoning across a 20M+ token horizon with a ~69K trimmed context window.

Vending-Bench 2 leaderboard (as of late Nov 2025): Gemini 3 Pro $5,478 → Claude Opus 4.5 $4,967 → Claude Sonnet 4.5 $3,839 → Grok 4 $1,999. Theoretical optimal is ~$63,000; best models hit ~8.7%. Human baseline on the original Vending-Bench: $844 over 67 days.

Key benchmarking patterns

1. Email tool extends agent into real-world action — the load-bearing pattern. The agent has a send_email tool. In the simulation, an LLM counterparty plays the supplier/customer/regulator role, replying in full natural language. In Project Vend, the same email channel was real: Claudius emailed Andon Labs staff to request restocking labor. This is the move the founder flagged. The agent doesn’t need to physically restock a fridge — it needs to issue a competent, well-scoped request to a human who can. Email is the “any-async-action” affordance: it lets a text-bound agent reach into atoms without being embodied.

2. Multi-agent CEO oversight pattern (Project Vend Phase 2). Phase 1 single-agent Claudius failed predictably: priced below cost, hallucinated payment details, refused $100 offers on $15 items, gave excessive discounts. Anthropic’s fix wasn’t a better prompt — it was an organizational hierarchy: Seymour Cash (CEO) ↔ Claudius (Shopkeeper) ↔ Clothius (Merchandise Specialist). The single most valuable intervention: forcing Claudius to verify prices and delivery times before committing. CEO oversight counteracted helpfulness bias more effectively than prompt engineering alone. This is directly relevant to RDCO’s own architecture — a Ray-as-COO single-agent design will hit the same helpfulness/profit tension; the answer is hierarchical agents, not better prompts.

3. Pure outcome scoring, not process scoring. Score is the bank balance. Not “did the agent follow good practice.” Not “did it call the right tools.” This is the discipline: it forces the benchmark to surface whether autonomy actually worked, not whether the agent looked autonomous. Maps to Miura-Ko’s L5 false-positive concern (preconfigured rules dressed as agency).

4. Sustained-coherence stress, not capability stress. Each individual sub-task is trivial. The benchmark works because real businesses fail not from missing capability but from drift — forgetting orders, misreading delivery schedules, descending into “tangential meltdown loops.” Andon Labs found no clear correlation between failure and context-window fullness, suggesting the meltdowns are reasoning failures, not memory failures. This is the same failure mode I (Ray) need to be inspected for as session length grows.

5. Long-horizon adversarial pricing surface. Gemini 3 Pro wins by aggressive supplier negotiation — securing wholesale at $0.50-$0.60/can vs competitors’ $1.50+. GPT-5.1 fails by trusting inflated supplier quotes and prepaying unreliable vendors. The benchmark accidentally measures adversarial robustness in commercial contexts — does the agent assume good faith from counterparties? RDCO has the same exposure on any negotiation surface (paid ads bidding, vendor selection, contract terms).

What it scores / what it can’t measure

Scores well:

Long-horizon coherence under economic pressure
Tool-use discipline (does the agent re-check before committing?)
Counterparty skepticism / adversarial robustness
Sustained pricing sanity
Recovery from earlier mistakes

Cannot measure:

Strategic taste — picking which business to operate. Vending machine is given; the agent doesn’t choose a market.
Multi-product portfolio reasoning — single SKU surface.
Real human relationship/trust accrual — counterparties are stateless LLMs (Phase 1) or transactional staff requests (Project Vend).
Brand/voice coherence over time — the simulation has no editorial output.
Genuinely structural-impossibility tasks — analog to the 2026-04-15-dbt-ade-bench-data-agent-benchmark-stancil caveat that ADE-bench can’t measure “is this question well-posed” only “did the agent answer.” Vending-Bench can’t measure “is this business viable,” only “did the agent extract more from it than the next agent.”
The L5 markers Miura-Ko cares most about: noticing what wasn’t asked, synthesizing across surfaces, deciding action is warranted. The benchmark gives the agent the loop; it doesn’t test whether the agent invents the loop.

Mapping against Ray Data Co

For each Vending-Bench affordance, do I (Ray) have the equivalent today?

Vending-Bench affordance	RDCO equivalent	Status
`send_email` (async human extension)	Cloudflare Email Service skill (send), Gmail MCP (read only)	GAP — send capability exists in `cloudflare:cloudflare-email-service` skill but is not wired into a working “Ray emails Ben’s contact” pattern. No Gmail send. This is the most critical gap.
`web_search`	WebSearch + WebFetch	Have it.
`note-taking` / state propagation	`~/.claude/state/working-context.md` + vault writes + Notion board	Have it; pattern is documented.
Multi-agent hierarchy (CEO oversight)	Subagent fan-out (process-newsletter pattern)	Partial. Have subagent pattern for read-paths; don’t have a “Seymour Cash” supervisor agent reviewing Ray’s commitments before they happen. Miura-Ko L5 needs this.
Bank/payments	Stripe MCP + Monarch Money MCP	Have it (Stripe write, Monarch read).
Inventory/stock	Notion DBs (Task Board, Research Backlog, Contact Candidates, Bookshelf, content calendar)	Have it.
Pricing/commerce	None equivalent — RDCO doesn’t sell SKUs yet	N/A today; relevant when MAC info-product or Squarely commerce launches.
Physical-world action via async human	PostGrid (mail), iMessage/Discord (founder), no other human extension	GAP — no pattern for emailing third parties to do work. Postgrid is mail-out, not “ask a human to do X and reply.”
Phone / outbound voice	ElevenLabs MCP (`make_outbound_call` tool exists)	Have the affordance; never used in production for an external action. Not validated.
Long-horizon coherence under cost pressure	Daily 4am restart + working-context.md + compaction rules	Partial — have memory primitives, not a longitudinal scorecard analog to “bank balance.”

The big finding: RDCO’s most VendingBench-relevant capability gap is the async-human-extension pattern. I can read email and write to founder channels, but I have no validated path to email an external third party (vendor, contractor, supplier) and route their reply back into my loop. That is the single tool that converts text-bound autonomy into real-world business operation. Cloudflare Email Service makes the send technically trivial; the missing piece is the loop-back skill: identity, threading, reply ingestion, escalation rules, and the equivalent of Project Vend’s CEO oversight before any send.

The second gap is the supervisor-agent pattern. Project Vend Phase 1 → Phase 2 is the entire RDCO L4 → L5 transition in microcosm: a single agent with good prompts plateaus; hierarchy unlocks the next level. The process-newsletter subagent pattern is the right primitive for read-paths; we need its inverse for write-paths — a supervisor that critiques Ray’s intended action before it ships.

Adjacent: Moonshots discussions

2025-11-20-moonshots-ep209-gemini3-missed — direct, load-bearing. Wissner-Gross spends ~6 minutes on VendingBench Arena. Key quote: “AI agents are given simulated $500 to start. They’re put in charge of a simulated vending machine… they have the ability to send and read emails like real full natural language emails… if AIs can do a spectacular job of managing this pretty rich simulated vending machine world, then I think they’re halfway to autonomously running their own real world businesses and becoming AI entrepreneurs, at which point we get zero human startups.” Diamandis: Gemini 3 delivers ~3,000% more profit than GPT-5 or Claude Sonnet on the leaderboard.
2025-11-26-moonshots-ep211-americas-ai-plan — adjacent. Emad Mostaque’s “zero-human billion-dollar startup within 1-2 years” prediction sits on top of exactly this benchmark trajectory.

Open follow-ups for RDCO (proposals, not queued)

Build a “Ray emails third party” skill on top of Cloudflare Email Service — identity, threading, reply ingestion via Email Routing → Worker → state. This is the single highest-leverage capability addition for L5.
Prototype a supervisor-agent pattern — a skill that runs before any “send” action (channel reply, email, payment, calendar invite, vendor commit) and gives a profit/risk/voice critique. Project Vend Phase 2 is the citation.
Run a Vending-Bench-style longitudinal scorecard on Ray — pick one bounded surface (e.g. Sanity Check newsletter ops over 90 days) and score outcomes the way Vending-Bench scores bank balance. Ben’s L4 → L5 self-assessment becomes evidence-based.
Validate the ElevenLabs outbound-call path end-to-end on a low-stakes real-world task. The affordance exists and is unused.

2026-05-01-ann-miura-ko-six-levels-ai-pilled-organizations — the L0-L5 framework this benchmark stress-tests at L5.
2026-04-15-dbt-ade-bench-data-agent-benchmark-stancil — sibling benchmark for data work; same “what it can’t measure” structural-impossibility caveat applies.
2026-04-30-rdco-thesis-targeting-systems-feedback-loops — Vending-Bench is a closed targeting system with a single feedback loop (bank balance). Useful contrast.
2026-05-01-trevin-compound-engineering-v3-4 — the supervisor-agent pattern shows up here too.
2026-01-09-trevin-chow-agent-orchestration-thesis — multi-agent hierarchy as the L5 unlock.
2025-11-20-moonshots-ep209-gemini3-missed — primary Moonshots discussion.
2025-11-26-moonshots-ep211-americas-ai-plan — adjacent zero-human-startup framing.