“How to really stop your agents from making the same mistakes” — Garry Tan
Why this is in the vault
Tan published the load-bearing engineering blueprint for the skillify pattern — the 10-step checklist that turns every agent failure into a structurally-impossible-to-recur fix. RDCO has ~22 skills written in the loose SKILL.md form but is operating at steps 1-2 of his 10-step gating. He’s also released the open-source quality-gate engine (gbrain) that runs the full checklist as gbrain doctor. This is directional validation AND a concrete shopping list of upgrades to RDCO’s harness.
Sponsorship
Self-promotion only — Tan promotes his own open-source gbrain quality-gate engine (github.com/garrytan/gbrain) and the companion gstack repo throughout the article. No third-party paid placement. Bias is the obvious YC/Tan one: the 10-step checklist is also the sales surface for the gbrain doctor tool. Treat the technical claims as load-bearing (they’re independently testable) and the “you should run gbrain” CTAs as marketing — same pattern as a vendor whitepaper.
The two-tweet load (article + quote-tweet)
Tweet 1 (2026-04-22 09:01 UTC, 217k impressions, 2,302 bookmarks): the long-form X article “How to really stop your agents from making the same mistakes”. Pulled in full via xmcp.
Tweet 2 (2026-04-22 15:55 UTC, 17k impressions, 204 bookmarks): quote-tweet of #1 — “this cycle has replaced 50% of my agentic coding… I do something with OpenClaw, then I say SKILLIFY IT.”
Tan’s central argument
LangChain raised $160M and shipped sophisticated testing primitives (LangSmith — trajectory evals, trace-to-dataset pipelines, LLM-as-judge, regression suites) but never told you what to test, in what order, or when you’re done. Pieces aren’t a practice. Most AI agent reliability is “vibes-based” — prompt tweaks, bigger system messages, “please don’t hallucinate” incantations that decay the moment context gets complex.
The fix is “skillify” — every failure becomes a permanent structural fix as a skill with tests that run every day, forever.
The two failures Tan documents
Failure 1 — Calendar lookup. Agent asked about an old business trip. Agent called live calendar API → blocked. Tried email search → noisy. Tried calendar API again → blocked. Five minutes later, finally searched local knowledge base and found it instantly. The answer was in 3,146 indexed local calendar files the whole time. Agent did latent-space reasoning when a three-line script would’ve returned the answer.
Failure 2 — Timezone math. Agent said “next meeting in 28 minutes.” Reality: 88 minutes. Did UTC→PT conversion in its head, off by exactly an hour. A context-now.mjs script already existed that returns the right answer in 50ms. Agent just didn’t run it.
The shape of both bugs: deterministic work done in latent space. “The model was doing mental math when a script had the answer.”
The 10-step skillify checklist (the load-bearing artifact)
Every failure that gets promoted to a skill must clear all 10:
- SKILL.md — the contract: name, triggers, rules
- Deterministic code —
scripts/*.mjs(no LLM for what code can do) - Unit tests — vitest
- Integration tests — live endpoints
- LLM evals — quality + correctness via LLM-as-judge
- Resolver trigger — entry in
AGENTS.md - Resolver eval — verify the trigger actually routes (both deterministic table presence AND LLM routing behavior)
- Check-resolvable + DRY audit — find dark skills, find overlapping triggers
- E2E smoke test
- Brain filing rules — every skill that writes to KB knows where things go
Tan’s exact language: “A feature that doesn’t pass all ten is not a skill. It’s just code that happens to work today.”
The latent vs deterministic distinction
This is the conceptual heart. Tan separates work into:
- Latent — requires judgment, model needed, varies on input
- Deterministic — same input, same output, every time, no model needed
The skill enforces using deterministic tools for deterministic work. The model’s intelligence is used to WRITE the deterministic script (latent → produces deterministic). Then the script is the constraint that prevents the model from being stupid on subsequent invocations.
“The agent used judgment (latent) to write calendar-recall.mjs. Now the skill forces the agent to run that script instead of reasoning about calendar data. The model’s intelligence created the constraint that prevents the model from being stupid.”
Resolver evals — the layer most people miss
A resolver = routing table for context. When task type X appears, load skill Y. Tan has 50+ resolver test cases like:
{ intent: 'check my signatures', expectedSkill: 'executive-assistant' }
{ intent: 'who is Pedro Franceschi', expectedSkill: 'brain-ops' }
{ intent: 'what time is my meeting', expectedSkill: 'context-now' }
{ intent: 'find my 2016 trip', expectedSkill: 'calendar-recall' }
Two failure modes the resolver eval catches:
- False negative: skill should fire but doesn’t (trigger description wrong/missing)
- False positive: wrong skill fires (two triggers overlap)
Both deterministic structural tests (does AGENTS.md table contain the right mapping?) AND LLM routing tests (given this intent, does the model actually pick the right skill?) — both layers matter.
Check-resolvable + DRY audit (step 8)
Tan ran this against his own 40+ skills and found 6 unreachable skills (15% of capabilities dark) — scripts that did useful work but had no resolver path. A flight tracker nobody could invoke. A content-ideas generator only triggerable by cron. A citation fixer not in the resolver table.
Three things check-resolvable checks:
- Every SKILL.md has a corresponding resolver entry
- Every script referenced by a skill is actually callable
- No two skills have overlapping trigger descriptions
Plus DRY audit: detects skills that overlap in capability (4 calendar skills with zero overlap is fine; 5 calendar skills with overlapping triggers fails).
The “fucking shit” eval heuristic
The most honest eval-coverage signal Tan offers: “search your conversation history for when you said ‘fucking shit’ or ‘wtf.’ Those are the test cases you’re missing.”
GBrain — the open-source engine
Tan released two repos:
- gstack (
github.com/garrytan/gstack) — Claude Code speedup - gbrain (
github.com/garrytan/gbrain) — knowledge engine + quality gates
gbrain doctor runs the 10-step checklist. gbrain doctor --fix auto-repairs DRY violations, replaces duplicated blocks with convention references, all guarded by git working-tree checks.
Hermes Agent comparison (the explicit competitive frame)
Hermes Agent (NousResearch) handles skill creation well: skill_manage tool for self-modification, progressive disclosure (load skill index, pull SKILL.md only when selected), bounded MEMORY.md (capped 2,200 chars), conditional activation (skills auto-hide when required tools missing).
Hermes does NOT do the verification layer: no unit tests, no resolver evals, no check-resolvable, no DRY audit, no daily health check. Tan’s claim: “You need both.” Hermes for creation, GBrain for verification.
Mapping against Ray Data Co
This is a checklist of concrete upgrades RDCO needs. Where we are vs where Tan is:
| Step | RDCO state | Tan state |
|---|---|---|
| 1. SKILL.md | ✅ ~22 skills | ✅ 40+ |
| 2. Deterministic code | ✅ partial (audit-newsletter-outputs.py is the model) | ✅ standard |
| 3. Unit tests | ❌ none | ✅ 179 tests / 5 suites |
| 4. Integration tests | ❌ none | ✅ live endpoints |
| 5. LLM evals | ❌ none | ✅ 35+ evals daily |
| 6. Resolver trigger | ❌ no AGENTS.md | ✅ resolver table |
| 7. Resolver eval | ❌ none | ✅ 50+ cases |
| 8. Check-resolvable + DRY audit | ❌ none | ✅ gbrain doctor |
| 9. E2E smoke test | ❌ none | ✅ standard |
| 10. Brain filing rules | ⚠️ partial (vault folder conventions exist but aren’t programmatically enforced) | ✅ enforced |
Two strategic implications:
-
Adopt GBrain or build the equivalent. Tan has open-sourced the workout plan. The fastest path to the missing 8 steps of the checklist is to read his code and either (a) adopt directly, (b) port concepts to RDCO’s architecture, (c) build our own with the same philosophy. Item (a) is meaningfully faster but creates a dependency on Tan’s project trajectory. Decision worth making explicitly.
-
The MAC framework + annotation layer just shipped is the data-quality parallel of Tan’s skill-quality framework. Both argue: the mechanical artifact is half the practice; the audit/verification/annotation layer that turns it into a learning system is the other half. MAC : data models :: GBrain : skills. This is the framing for any forthcoming Sanity Check piece on the meta-pattern.
Two tactical adoptions worth making this week:
- “SKILLIFY IT” as a one-word verb. Adopt in working vocabulary, MAC Masterclass copy, blog content. Faster to invoke than “extract this into a skill.” Tan’s example: “hot damn it worked. can you remember this as a webhook skill and skillify it” → one message → SKILL.md + tests + resolver entry + DRY audit, all generated.
- “fucking shit” / “wtf” eval mining. Trivial automation: parse session transcripts in
~/.claude/projects/-Users-ray/*.jsonlfor those phrases (and “ugh”, “no”, “wrong”); each hit → candidate test case for an/improvecycle.
Limitations
- The article is a manifesto + how-to; Tan is selling GBrain. The “50% of my agentic coding” claim is unaudited self-report.
- We can’t independently verify the test counts (179 unit tests, 35 LLM evals daily, 50+ resolver evals) without running the GBrain repo.
- Tan is YC CEO with a heavily-promotional channel — directional signal is real, precise numbers are marketing copy.
Related
- 2026-04-11-garry-tan-thin-harness-fat-skills — the original thesis piece this workflow operationalizes
- ../04-tooling/xmr-charts/mrr-bridge-and-annotation-layer — the data-quality parallel (MAC : data :: GBrain : skills)
- 2026-04-15-thariq-claude-code-session-management-1m-context — Anthropic guidance on harness/skills architecture
- 2026-04-12-harness-thesis-dissent — counter-positions worth holding in tension
- ../06-reference/concepts/operational-definitions — the criterion+test+decision-rule structure underlying both skillify and MAC