06-reference

bookshelf source material architecture gap

Wed Apr 29 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·concept ·source: founder iMessage thinking + Dami-Defi X post (trigger only) + Ray synthesis ·by Ben Wilson (founder, gap identification); Ray (architecture proposal)

The bookshelf gap — RDCO has synthesis, doesn’t have source-material retrieval

Why this exists

Founder shared a thin X post claiming “I fed 12 free MIT AI textbooks into Claude, it rebuilt my entire research system” and re-framed it as a meta-pattern question:

The LLMs are big, they have a lot of knowledge baked into them. You said you know Wheeler already, it was part of the training corpus, but that doesn’t mean you have perfect recall and won’t need to reference his books or ground the thinking in a citation. Then we have our personal knowledge base where we write down our takeaways from the content diet. That’s good for forming our own opinions on top of the work we do.

Do we have a good way of maintaining the searchable source material? We can download the file, but how do we find the right bits. It cannot be completely baked into the model and even our best synthesis into the knowledge base will be lossy. Where is our “bookshelf” if we need to go back to the source to rediscover the details for a specific situation?

This concept page captures the gap and proposes the architecture to close it.

The current state

RDCO has three retrieval layers today:

  1. Vault synthesis layer (~/rdco-vault/) — our takeaways, frameworks, assessment notes. Lossy compression of source material. ~1300+ docs.
  2. QMD semantic search — vector + lexical search over the vault. Searches our synthesis, NOT source material.
  3. YouTube transcripts (~/rdco-vault/06-reference/transcripts/) — full raw transcripts saved alongside assessment notes. Partial bookshelf — only YouTube content.

What we DON’T have:

The architecture proposal

Filesystem structure

~/rdco-vault/07-source-material/
├── books/
│   ├── wheeler-judgment-under-uncertainty/
│   │   ├── source.pdf
│   │   ├── extracted.txt
│   │   └── metadata.yaml
│   ├── tufte-visual-display-of-quantitative-information/
│   │   └── ...
├── articles/
│   ├── 2026-04-30-jonathan-siddharth-turing-superintelligence-loop/
│   │   ├── source.html
│   │   ├── extracted.md
│   │   └── metadata.yaml
├── transcripts/
│   └── (existing YouTube transcripts moved or symlinked here)
├── newsletter-bodies/
│   ├── stratechery/
│   │   └── 2026-04-30-amazon-earnings-trainium.md
│   └── seattledataguy/
│       └── ...
└── papers/
    └── arxiv-2403.12345-attention-is-all-you-need/
        ├── source.pdf
        ├── extracted.txt
        └── metadata.yaml

Per-source metadata.yaml carries title, author, source URL, date acquired, copyright disclaimer (private reference only, not for redistribution), and any extraction-quality notes (OCR vs native PDF text, missing pages, etc.).

QMD indexing

A SECOND QMD collection — source-material — distinct from the existing rdco-vault collection. Reason for separation:

The existing mcp__qmd__query tool already supports collection parameter — extend it with collection: "source-material" for source retrieval.

Skills

/save-to-bookshelf <url-or-path> — new skill. Workflow:

  1. Detect input type (URL, local PDF, local text file)
  2. Fetch / read source
  3. Extract text (pdftotext for PDFs, pandoc for HTML, direct read for txt/md)
  4. Compute content hash for deduplication
  5. Generate slug + create directory under 07-source-material/<type>/<slug>/
  6. Save raw source + extracted text + metadata.yaml
  7. Trigger QMD ingest into source-material collection
  8. Return: path + slug + page-count or word-count

/cite-from-bookshelf <claim-or-query> — new skill (or extension to existing query patterns). Workflow:

  1. Query the source-material QMD collection with the claim/query
  2. Return top 3-5 passages with source title, page/timestamp, source URL, exact quote (≤15 words per copyright rule, longer paraphrase for context)
  3. Format ready to embed as a citation in vault note or external content

Existing skills to extend:

Retrieval pattern

When Ray makes a strong claim that benefits from grounding:

  1. Query source-material collection first via /cite-from-bookshelf
  2. If a relevant passage exists, embed as <cite source="..."> with quote + page
  3. If no passage exists, mark the claim as “uncited from training memory; consider adding source to bookshelf if this becomes load-bearing”

This is the “bookshelf” the founder asked about — not just storage, but a retrieval discipline that closes the loop between source and synthesis.

Connection to quality-gate-as-brain (morning’s re-frame)

From the 2026-04-30-quality-gate-as-brain-org-boundaries-agentic-companies concept page: in an agent-native company architecture, the quality gate is the brain. Tools, sensors, policy, and learning all serve it.

The bookshelf is the canonical source-material instrumentation for the quality gate. Without it:

This isn’t a nice-to-have. It’s a Layer 1 (sensors+data) gap that has been partially addressed (YouTube transcripts) but not systematized.

Cost estimate

Open curation questions for founder

These are not Ray-execution; they’re founder-judgment:

  1. Domains worth canonical-shelving:

    • Data engineering / data quality (MAC’s parent discipline)
    • Decision theory + judgment under uncertainty (Wheeler, Kahneman, Tetlock — informs MAC + Sanity Check)
    • Systems thinking (Meadows, Senge)
    • ML / AI fundamentals (MIT OCW textbooks, Goodfellow/Bengio Deep Learning)
    • Accounting / finance (RDCO operating discipline + future client-reporting offering)
    • Business strategy + competitive moats (Porter, Moore, Christensen — Sanity Check editorial source)
    • Operating playbooks (Mitohealth-style company-design content, Garry Tan / YC content)
  2. Specific titles per domain: founder picks. Ray can identify candidates per domain on request, but the “is this canon for RDCO” call is founder-only.

  3. Backfill order: start with the domain most-active in current SC editorial pipeline (likely data engineering + decision theory + agent-native company design), expand from there.

  4. Paywalled / pirated handling: Bookshelf is for legal personal-use copies (free OCW PDFs, purchased Kindle/PDF, public domain, fair-use article archives). Ray will refuse to download / scrape paywalled content. Founder handles his own purchases; Ray ingests once acquired.

What to NOT do

Recommendation

Scaffold the architecture this afternoon (low-risk, reversible). Surface the curation decisions to founder as a separate decision-needed item — what domains, what titles, what order. Don’t pre-fill the bookshelf without founder direction.