Ray Data Co Agentic Team Architecture (v1)

Why this exists

Ray (the always-on Claude Code agent on the Mac mini) is RDCO’s operator. When Ray hangs — stuck on a permission prompt, a tool that won’t return, a pane that fell out of conversation focus — the founder has to remote in via Tailscale, attach to tmux, and unstick it by hand. That works for a technical operator and fails for everyone else. The single-agent setup is also single-fault: if Ray’s process is the supervisor of itself, there is no recursion stop. This document captures the v1 multi-agent supervision architecture decided 2026-04-19 over Discord. The goal is a team that recovers itself from common failure modes without the founder in the loop, while keeping a clean recursion floor (launchd) and minimizing the surface area where flat-shared state can collide.

The team

Three named Claude Code agents, each in its own tmux pane on the Mac mini, each with its own bash watchdog and launchd job underneath.

Ray — primary operator. iMessage + Discord. Owns scheduled-jobs.txt and all financial-system credentials. This is the current production agent; the team is being built around it.
Felix — IT support specialist. Discord-only, responds when @tagged or paged by Ray’s bash watchdog. Watches Ray and any future specialty agents. Cannot send iMessage. Cannot touch financial credentials. Can read tmux state, query Notion/vault/graph, and run safe diagnostic skills.
Felix Jr — second-line backup IT support. Discord-only, mutual peer to Felix. Watches Felix (so Felix isn’t watching itself). Same restricted permission surface as Felix. Exists so that when Felix is the one stuck, there is a peer to escalate to before paging the founder.

Escalation order on a hang:

Bash watchdog attempts deterministic recovery (safe-pattern dismissal, send Enter, send Esc).
If recovery fails, watchdog pages the peer Claude (Ray hangs → page Felix; Felix hangs → page Felix Jr; Felix Jr hangs → page Felix).
If peer can’t recover within a bounded window, page the founder on Discord.

The 3-tier supervision pattern

launchd (Apple's process supervisor — recursion stops here)
   └── bash watchdog (~80 lines, polls every 2 min)
         └── Claude Code agent (Ray | Felix | Felix Jr)

Why three tiers and not more. The temptation is to have Claude watch Claude watch Claude. That is recursion without a floor, and a flaky model becomes a flaky watchdog. Instead:

launchd is dumb, deterministic, and Apple-supported. It restarts the bash watchdog if it dies. It does not understand “stuck” — only “exited.”
Bash watchdog is ~80 lines, no model, no judgment. It reads ~/.claude/state/heartbeat-<agent>.txt, checks staleness, runs an allowlist of safe deterministic recoveries (send Enter to dismiss a known-safe permission prompt, kill+respawn the pane on a hard hang), and if those fail, it sends a Discord ping to the peer Claude with the diagnostic packet. The bash layer is the only thing that decides “Claude is stuck” — Claude never decides that about itself.
Peer Claude is intelligent recovery. Once paged with the diagnostic packet, the peer can read tmux state, query the vault for prior similar incidents, run /diagnose-primary, and either fix or escalate to the founder.

Heartbeat mechanism: a PostToolUse hook in each agent’s settings.json writes the current epoch to ~/.claude/state/heartbeat-<agent>.txt on every tool call. Watchdog reads file mtime. If stale beyond threshold, agent is presumed stuck.

Hardening boundaries (where flat breaks)

The default is flat: all three agents share ~/.claude/skills/, ~/rdco-vault/, ~/.claude/state/graph.duckdb, and the MCP server fleet. Differentiation happens in three places:

Differentiated SOUL.md + CLAUDE.md per agent. This is the role-definition surface. Felix’s SOUL is “diagnostic IT specialist, terse, escalate fast.” Ray’s SOUL is the existing COO posture. Felix Jr mirrors Felix.
Channel ownership. Only Ray writes to iMessage. Only Ray sends scheduled outbound on Discord. Felix and Felix Jr are reactive: @-tag or watchdog-page only.
Cron ownership. Only Ray modifies scheduled-jobs.txt. Felix and Felix Jr can read it (to diagnose missed jobs) but cannot write.
Financial-system credentials. Only Ray has Monarch / AWS billing / Cloudflare billing credentials, accessed via 1Password CLI. Felix and Felix Jr have a separate, narrower 1Password vault item that excludes financial scopes. This is the one place the multi-agent setup explicitly buys safety: a stuck or hijacked Felix cannot drain a bank account.
MCP auth tokens are per-agent. Same MCP servers, different tokens, so the audit trail can attribute every tool call to the right agent.
Vault writes for audit trail. All three agents write to ~/rdco-vault/05-meetings/diagnostics/<date>-<agent>-<topic>.md whenever they make a recovery decision. This gives us a reviewable record of what happened during each incident.

Build sequence

Three tiers, sequenced so each step’s failure mode is contained.

Tier 1 — this week, after AWS audit lands. Bash watchdog for Ray’s existing pane. PostToolUse heartbeat hook in Ray’s settings.json. Safe-pattern dismissal allowlist (the known permission prompts that are always-safe to Enter through). Escalation logic that pages the founder on Discord (Felix doesn’t exist yet). Launchd plist that supervises the watchdog. Target: Ray recovers from the most common 80% of hangs without founder involvement.

Tier 2 — next week. Felix tmux pane. Separate Claude Code config dir (~/.claude-felix/) to keep state hermetic. Narrow CLAUDE.md and SOUL.md (diagnostic-specialist posture, Discord-only). Launchd job. /diagnose-primary skill that knows how to read Ray’s heartbeat, tmux capture, and recent vault diagnostics. Separate 1Password vault item with no financial scope. Update Ray’s watchdog to page Felix instead of the founder as the second step.

Tier 3 — next week, after Felix is stable for ~3 days. Felix Jr — mirror of Felix with the watching-Felix responsibility. At this point the team is closed: every agent has a peer that can recover it.

Open questions for future iteration

SMS escalation for the friend’s variant of this setup (they don’t run iMessage; need Twilio).
Tailscale-tunneled web view of tmux so the founder can dismiss prompts from his phone without SSH’ing in. Currently deferred — the multi-agent recovery should reduce demand for this.
Multi-agent skill-conflict resolution. If Ray and Felix both try to write the same vault file in the same minute, what wins? File locking? Last-write-wins with diff in the diagnostic log? Probably moot in practice (Felix is reactive only) but needs a documented policy before Tier 3.
Audit trail review cadence. Diagnostics will pile up in 05-meetings/diagnostics/. Weekly review during Sunday planning, monthly summary into the tooling log? TBD.
SOUL.md / CLAUDE.md template. We’re about to write three of these. A shared template with role-specific overrides (and a doc explaining the surface) would prevent drift. Likely a follow-up tooling doc once Felix exists.

This architecture is the harness applied to the harness operator — see the harness-thesis cluster (2026-04-19-garry-tan-build-the-car-jepsen-response) and the state-ownership architecture (rdco-state-ownership-architecture) for the principles being instantiated here. The visualization-and-recall side of the founder’s mental model for this team is in 2026-04-19-jaynit-neuroscience-of-visualization — naming the agents (Ray / Felix / Felix Jr) is itself a recall-and-handle hack consistent with that note.