04-tooling

agentic team architecture

Sat Apr 18 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·tooling-decision ·status: approved-design

Ray Data Co Agentic Team Architecture (v1)

Why this exists

Ray (the always-on Claude Code agent on the Mac mini) is RDCO’s operator. When Ray hangs — stuck on a permission prompt, a tool that won’t return, a pane that fell out of conversation focus — the founder has to remote in via Tailscale, attach to tmux, and unstick it by hand. That works for a technical operator and fails for everyone else. The single-agent setup is also single-fault: if Ray’s process is the supervisor of itself, there is no recursion stop. This document captures the v1 multi-agent supervision architecture decided 2026-04-19 over Discord. The goal is a team that recovers itself from common failure modes without the founder in the loop, while keeping a clean recursion floor (launchd) and minimizing the surface area where flat-shared state can collide.

The team

Three named Claude Code agents, each in its own tmux pane on the Mac mini, each with its own bash watchdog and launchd job underneath.

Escalation order on a hang:

  1. Bash watchdog attempts deterministic recovery (safe-pattern dismissal, send Enter, send Esc).
  2. If recovery fails, watchdog pages the peer Claude (Ray hangs → page Felix; Felix hangs → page Felix Jr; Felix Jr hangs → page Felix).
  3. If peer can’t recover within a bounded window, page the founder on Discord.

The 3-tier supervision pattern

launchd (Apple's process supervisor — recursion stops here)
   └── bash watchdog (~80 lines, polls every 2 min)
         └── Claude Code agent (Ray | Felix | Felix Jr)

Why three tiers and not more. The temptation is to have Claude watch Claude watch Claude. That is recursion without a floor, and a flaky model becomes a flaky watchdog. Instead:

Heartbeat mechanism: a PostToolUse hook in each agent’s settings.json writes the current epoch to ~/.claude/state/heartbeat-<agent>.txt on every tool call. Watchdog reads file mtime. If stale beyond threshold, agent is presumed stuck.

Hardening boundaries (where flat breaks)

The default is flat: all three agents share ~/.claude/skills/, ~/rdco-vault/, ~/.claude/state/graph.duckdb, and the MCP server fleet. Differentiation happens in three places:

Build sequence

Three tiers, sequenced so each step’s failure mode is contained.

Tier 1 — this week, after AWS audit lands. Bash watchdog for Ray’s existing pane. PostToolUse heartbeat hook in Ray’s settings.json. Safe-pattern dismissal allowlist (the known permission prompts that are always-safe to Enter through). Escalation logic that pages the founder on Discord (Felix doesn’t exist yet). Launchd plist that supervises the watchdog. Target: Ray recovers from the most common 80% of hangs without founder involvement.

Tier 2 — next week. Felix tmux pane. Separate Claude Code config dir (~/.claude-felix/) to keep state hermetic. Narrow CLAUDE.md and SOUL.md (diagnostic-specialist posture, Discord-only). Launchd job. /diagnose-primary skill that knows how to read Ray’s heartbeat, tmux capture, and recent vault diagnostics. Separate 1Password vault item with no financial scope. Update Ray’s watchdog to page Felix instead of the founder as the second step.

Tier 3 — next week, after Felix is stable for ~3 days. Felix Jr — mirror of Felix with the watching-Felix responsibility. At this point the team is closed: every agent has a peer that can recover it.

Open questions for future iteration

This architecture is the harness applied to the harness operator — see the harness-thesis cluster (2026-04-19-garry-tan-build-the-car-jepsen-response) and the state-ownership architecture (rdco-state-ownership-architecture) for the principles being instantiated here. The visualization-and-recall side of the founder’s mental model for this team is in 2026-04-19-jaynit-neuroscience-of-visualization — naming the agents (Ray / Felix / Felix Jr) is itself a recall-and-handle hack consistent with that note.