The bookshelf gap — RDCO has synthesis, doesn’t have source-material retrieval

Why this exists

Founder shared a thin X post claiming “I fed 12 free MIT AI textbooks into Claude, it rebuilt my entire research system” and re-framed it as a meta-pattern question:

The LLMs are big, they have a lot of knowledge baked into them. You said you know Wheeler already, it was part of the training corpus, but that doesn’t mean you have perfect recall and won’t need to reference his books or ground the thinking in a citation. Then we have our personal knowledge base where we write down our takeaways from the content diet. That’s good for forming our own opinions on top of the work we do.

Do we have a good way of maintaining the searchable source material? We can download the file, but how do we find the right bits. It cannot be completely baked into the model and even our best synthesis into the knowledge base will be lossy. Where is our “bookshelf” if we need to go back to the source to rediscover the details for a specific situation?

This concept page captures the gap and proposes the architecture to close it.

The current state

RDCO has three retrieval layers today:

Vault synthesis layer (~/rdco-vault/) — our takeaways, frameworks, assessment notes. Lossy compression of source material. ~1300+ docs.
QMD semantic search — vector + lexical search over the vault. Searches our synthesis, NOT source material.
YouTube transcripts (~/rdco-vault/06-reference/transcripts/) — full raw transcripts saved alongside assessment notes. Partial bookshelf — only YouTube content.

What we DON’T have:

Books / PDFs — no canonical home. Some live ad-hoc in /tmp, ~/Downloads, or scattered in the vault.
Articles — we file assessment notes (paraphrase) but rarely the full text. Re-finding a specific passage means re-fetching the article (often paywalled or 404’d by then).
Newsletter bodies — paraphrased into assessment notes, raw bodies discarded after Gmail fetch.
Web archives — WebFetch is ad-hoc, no persistence, can’t be re-queried later.
Source-material semantic search — no separate QMD collection for source corpus, so passage-level retrieval over books/articles is impossible.

The architecture proposal

Filesystem structure

~/rdco-vault/07-source-material/
├── books/
│   ├── wheeler-judgment-under-uncertainty/
│   │   ├── source.pdf
│   │   ├── extracted.txt
│   │   └── metadata.yaml
│   ├── tufte-visual-display-of-quantitative-information/
│   │   └── ...
├── articles/
│   ├── 2026-04-30-jonathan-siddharth-turing-superintelligence-loop/
│   │   ├── source.html
│   │   ├── extracted.md
│   │   └── metadata.yaml
├── transcripts/
│   └── (existing YouTube transcripts moved or symlinked here)
├── newsletter-bodies/
│   ├── stratechery/
│   │   └── 2026-04-30-amazon-earnings-trainium.md
│   └── seattledataguy/
│       └── ...
└── papers/
    └── arxiv-2403.12345-attention-is-all-you-need/
        ├── source.pdf
        ├── extracted.txt
        └── metadata.yaml

Per-source metadata.yaml carries title, author, source URL, date acquired, copyright disclaimer (private reference only, not for redistribution), and any extraction-quality notes (OCR vs native PDF text, missing pages, etc.).

QMD indexing

A SECOND QMD collection — source-material — distinct from the existing rdco-vault collection. Reason for separation:

Source material is dramatically larger (textbooks alone could be millions of tokens)
Mixing collections dilutes search relevance for synthesis queries (the more common case)
Source-material queries are intentional (“ground this claim”) vs vault-synthesis queries (“what do we already know about this topic”)

The existing mcp__qmd__query tool already supports collection parameter — extend it with collection: "source-material" for source retrieval.

Skills

/save-to-bookshelf <url-or-path> — new skill. Workflow:

Detect input type (URL, local PDF, local text file)
Fetch / read source
Extract text (pdftotext for PDFs, pandoc for HTML, direct read for txt/md)
Compute content hash for deduplication
Generate slug + create directory under 07-source-material/<type>/<slug>/
Save raw source + extracted text + metadata.yaml
Trigger QMD ingest into source-material collection
Return: path + slug + page-count or word-count

/cite-from-bookshelf <claim-or-query> — new skill (or extension to existing query patterns). Workflow:

Query the source-material QMD collection with the claim/query
Return top 3-5 passages with source title, page/timestamp, source URL, exact quote (≤15 words per copyright rule, longer paraphrase for context)
Format ready to embed as a citation in vault note or external content

Existing skills to extend:

/process-newsletter — after writing assessment note, also save raw body to 07-source-material/newsletter-bodies/<sender>/
/process-youtube — already saves transcripts; just move/symlink to 07-source-material/transcripts/
/process-inbox — when filing source-material-class items (book PDFs, full articles), route to bookshelf instead of just 06-reference/

Retrieval pattern

When Ray makes a strong claim that benefits from grounding:

Query source-material collection first via /cite-from-bookshelf
If a relevant passage exists, embed as <cite source="..."> with quote + page
If no passage exists, mark the claim as “uncited from training memory; consider adding source to bookshelf if this becomes load-bearing”

This is the “bookshelf” the founder asked about — not just storage, but a retrieval discipline that closes the loop between source and synthesis.

Connection to quality-gate-as-brain (morning’s re-frame)

From the 2026-04-30-quality-gate-as-brain-org-boundaries-agentic-companies concept page: in an agent-native company architecture, the quality gate is the brain. Tools, sensors, policy, and learning all serve it.

The bookshelf is the canonical source-material instrumentation for the quality gate. Without it:

Ray’s strong claims are ungrounded (“Wheeler argues X” — but where? what page?)
Founder can’t sanity-check Ray’s recall against the original
The eval/quality gate has no canonical input it can verify against
Synthesis-layer-only architecture is structurally lossy

This isn’t a nice-to-have. It’s a Layer 1 (sensors+data) gap that has been partially addressed (YouTube transcripts) but not systematized.

Cost estimate

Build the architecture: ~2-3 hours of work (folder scaffold, /save-to-bookshelf skill, QMD collection setup, two skill extensions)
Backfill curation: the harder part. Picking the canonical 50-100 source texts that should populate the starter bookshelf is a founder-judgment call, not a Ray-execution call.
Storage: trivial. 50-100 books + a few thousand articles = ~10GB. Mac mini has it.
QMD ingest: existing pipeline; just point at the new collection.

Open curation questions for founder

These are not Ray-execution; they’re founder-judgment:

Domains worth canonical-shelving:
- Data engineering / data quality (MAC’s parent discipline)
- Decision theory + judgment under uncertainty (Wheeler, Kahneman, Tetlock — informs MAC + Sanity Check)
- Systems thinking (Meadows, Senge)
- ML / AI fundamentals (MIT OCW textbooks, Goodfellow/Bengio Deep Learning)
- Accounting / finance (RDCO operating discipline + future client-reporting offering)
- Business strategy + competitive moats (Porter, Moore, Christensen — Sanity Check editorial source)
- Operating playbooks (Mitohealth-style company-design content, Garry Tan / YC content)
Specific titles per domain: founder picks. Ray can identify candidates per domain on request, but the “is this canon for RDCO” call is founder-only.
Backfill order: start with the domain most-active in current SC editorial pipeline (likely data engineering + decision theory + agent-native company design), expand from there.
Paywalled / pirated handling: Bookshelf is for legal personal-use copies (free OCW PDFs, purchased Kindle/PDF, public domain, fair-use article archives). Ray will refuse to download / scrape paywalled content. Founder handles his own purchases; Ray ingests once acquired.

What to NOT do

Don’t try to ingest “everything” — over-broad bookshelf dilutes retrieval quality. Curate ruthlessly.
Don’t redistribute source material publicly. Bookshelf is private reference, not a content product.
Don’t skip the metadata layer. A book without source URL + acquisition date is worth less than a clean citation chain.
Don’t merge source-material and rdco-vault QMD collections. The whole point is that synthesis search and source search are different queries with different relevance models.

Recommendation

Scaffold the architecture this afternoon (low-risk, reversible). Surface the curation decisions to founder as a separate decision-needed item — what domains, what titles, what order. Don’t pre-fill the bookshelf without founder direction.

2026-04-30-quality-gate-as-brain-org-boundaries-agentic-companies — bookshelf is the instrumentation layer the quality gate reads from
2026-04-30-mitohealth-founder-5-layer-agent-native-company-loop — Layer 1 (sensors+data) is the bookshelf’s home
../01-projects/data-quality-framework — MAC framework grounding sources should live in the bookshelf when scaffolded
../.claude/skills/process-youtube/SKILL.md — existing partial bookshelf precedent (YouTube transcripts saved raw)
../.claude/skills/process-newsletter/SKILL.md — extension target (save newsletter bodies to bookshelf)
Notion Research Backlog: queue “bookshelf curation — domain + title backfill plan” as decision item if founder green-lights the scaffold