readme — vault

Goal

Ingest email newsletters from a whitelisted set of senders into the vault as structured assessment notes, with bias and sponsor flagging. Separate from Substack-specific tooling — this works on any newsletter that lands in ben@raydata.co regardless of publisher (Substack, Ghost, Mailchimp, self-hosted).

Companion skill: ~/.claude/skills/process-newsletter/SKILL.md.

Status

Prototype complete as of 2026-04-11 using SeattleDataGuy’s 9-email history as the proving ground (8 backfilled, 1 was pre-existing).
Skill file drafted. Whitelist locked in.
Remaining K senders not yet backfilled (pending founder go-ahead for the next batch).

Whitelist — locked in 2026-04-11

K = keep, backfill history + ongoing watch

Sender	Newsletter	Typical format	Notes
`email@stratechery.com`	Stratechery (Ben Thompson)	thought-leadership	60 actual in inbox (2026-04-12 discovery; 3 batch tasks created). Original “201+ in history” estimate was archive-based, NOT inbox-bounded.
`seattledataguy@substack.com`	SeattleDataGuy (Ben Rogojan)	hybrid	✅ backfill complete (8 articles + 1 pre-existing)
`notboring@substack.com`	Not Boring (Packy McCormick)	hybrid (long-form + Friday optimism curation)	28 actual in inbox (2026-04-30 discovery; 1 batch task created, 26 already filed). Original “201+ in history” estimate was archive-based.
`practicaldatamodeling@substack.com`	Practical Data Modeling (Joe Reis)	thought-leadership, series-based	32 actual in inbox (2026-04-30 discovery; 1 new batch task created, 23 already filed). Original “~20+” was closer but still off. ⚠️ paid sub lapsed 2026-04-24 — flagged to founder for resubscribe decision.
`analyticsengineeringroundup@substack.com`	Analytics Engineering Roundup	curation	~20+ in history (estimate — not yet discovery-scanned; assume lower per inbox-bounded pattern below)
`hello@every.to`	Every	multi-author thought-leadership	201+ in history (estimate — not yet discovery-scanned; assume MUCH lower per inbox-bounded pattern below)
`hello@ship30for30.com`	Ship30for30 (Start Writing Online)	writing craft / marketing	30+/180d
`writewithai@substack.com`	Write With AI	writing with AI tools	8+/180d
`michaeldean9@substack.com`	Essay Architecture (Michael Dean)	essay writing craft	curation-heavy
`ark@arkinvest.com`	ARK Invest (Cathie Wood)	investment commentary	weekly stock commentary
`newsletter@commoncog.com`	Commoncog (Cedric Chin)	thought-leadership, series-based	~201+ in history estimate — actual is inbox-bounded (founder re-subscribed 2026-04-15, so inbox has from-that-date forward). Operator’s field manual; tacit knowledge, expertise, sensemaking. Highly relevant to RDCO agent-deployer positioning.

⚠️ Count-before-budget rule (added 2026-04-30)

The original “N+ in history” estimates in the K table were Substack-archive-based, NOT Gmail-inbox-bounded. Real inbox counts have come back substantially smaller (Stratechery: 60 vs 201+ estimate; PDM: 32 vs 20+ estimate; Not Boring: 28 vs 201+ estimate). The gap is because Gmail only has from-subscription-date forward, not the sender’s full archive.

Rule for remaining un-scanned senders (Every, Commoncog, Analytics Engineering Roundup, ARK Invest, Write With AI, Ship30for30, Essay Architecture): run discovery with --dry-run semantics first — count messages in inbox BEFORE planning batch sizes. If count is small (<20), skip the batch-task overhead and process inline via Mode 3 (Backfill, legacy small-sender path). Don’t allocate 10+ batch tasks for a sender that only has 25 messages in the inbox.

Implication: the total backfill work is significantly smaller than the README originally implied. Prioritize the per-sender deep-fetch quality (sponsor detection, RDCO mapping discipline) over volume planning.

F = follow-forward only, no backfill (watch from now onward)

Sender	Newsletter	Typical format
`theinnermostloop@substack.com`	Innermost Loop (Alex Wissner-Gross)	thought-leadership
`dataengineeringcentral@substack.com`	Data Engineering Central	thought-leadership
`dataengineeringweekly@substack.com`	Data Engineering Weekly (Ananth Packkildurai)	curation
`technically@substack.com`	Technically	thought-leadership
`news@alphasignal.ai`	AlphaSignal	curation (AI/ML news)
`lon@dataelixir.com`	Data Elixir	curation (data science/ML news) — founder will resubscribe to ben@raydata.co; currently hits personal inbox
`semistructured@substack.com`	Semi-Structured (Jonathan Natkins)	thought-leadership (data infrastructure for AI agents) — added 2026-04-12

Known sender-specific gotchas (learned from SDG backfill)

SDG — SeattleDataGuy (fully backfilled)

Sponsor pattern: Estuary (disclosed adviser relationship) shows up in the top-of-article pre-amble ~60% of the time. SDG also switches to explicit self-consulting CTAs (“today’s article is sponsored by me, the Seattle Data Guy!”) — sometimes at top, sometimes mid, sometimes at bottom. Skill should flag both forms and not key detection off placement.
Curation section (“Articles Worth Reading”) is roughly 50/50 self-cross-promo vs third-party. He frequently links his own prior posts in the curation slot. The skill should label links by comparing the linked domain/author to the newsletter sender.
Series structure: SDG runs multi-month series (2026 “data pipelines” series, currently 6 articles in). Later articles reference earlier vocabulary. Backfill order matters — go oldest-first so cross-references make sense.
Guest posts exist — Olga Berezovsky hosted on Jan 23. Guest posts come with cross-promotion for the guest’s own newsletter in the top slot (different from the Estuary pattern).

PDM — Practical Data Modeling (Joe Reis) — partial backfill, paid sub lapsed 2026-04-24

Substack expiry sequence (3-email cadence): “ending in a week” → “ending tomorrow” → “subscription ended.” All three are pure billing/admin emails with zero extractable content. Skip on subject-line pattern alone in Mode 4 watch — don’t read the body. The 2026-04-30 PDM batch [1-5] confirmed this template.
Pulse-survey launch vs results mislabel risk: A “New pulse survey just dropped” email is the LAUNCH (CTA to take the survey), NOT the results. Joe Reis’s pattern is to present findings at a keynote (e.g., Stockholm May 7 2026) and publish a results-with-dataset post afterward. When discovery pre-triages a “pulse survey” subject, default the assumption to “launch CTA” and look for the follow-up results post 2-4 weeks later. The results post is the high-signal artifact; the launch is borderline-skip with one extractable data point at most.
Series structure: Long-running MMA chapter series (Ch 1-16 already filed, Feb-Apr 2026). Cross-references rely on chapter order — process oldest-first when backfilling.
Paid-tier risk on results posts: Reis ships results-with-dataset posts as paid-tier content. Watch the post-Stockholm May 7+ window — if plain_text returns paywall stub, founder needs to resubscribe (lapsed 2026-04-24) before that batch can process the substantive body.

Not Boring (Packy McCormick) — partial backfill, ongoing watch

Hybrid cadence: Mon long-form essay + Fri “Weekly Dose of Optimism” curation. Both formats live in the same inbox; classify per-issue, not per-sender.
Co-written essays = founder pitches. ~4 of 11 historical long-form essays follow the “Packy hands the keyboard to a portfolio-adjacent founder” pattern (Venezuela, Great Blue Frontier, etc.). Treat the company list and “what’s missing in the platform map” framing as the co-author’s go-to-market collateral, not neutral analysis. The underlying framework is usually still useful; the company endorsements are not. Frontmatter should byline both authors AND surface the structural bias in a Sponsorship section even when no explicit sponsor block exists. The Apr 23 2026 Great Blue Frontier essay (Will O’Brien / Ulysses Maritime) is the canonical exemplar — Packy explicitly disclaimed Not Boring Capital is NOT an investor, but the entire essay is a Series-A-adjacent positioning piece.
Two layers of sponsor on co-written essays: (a) explicit paid sponsor block (Framer, SVB, etc.) — clean disclosure, low bias; (b) structural sponsor — the co-author’s company. Always disclose both, separately.
WDoO portfolio-disclosure ambiguity: Packy doesn’t always disclose his angel positions in the curation slot. Absence of “Not Boring Capital is an investor” caveat ≠ no relationship. When a curation item is enthusiastic about a specific company (Apr 24 2026 Medra was the trigger), do a quick public-investor-list verification before treating the framing as neutral. Note the verification result in the assessment (“verified Not Boring Capital not in public Series A list” or “could not verify, treat as messenger”).
Sister-author cross-promo in WDoO: Kevin Kwok essays appear in the curation slot (“first Kevin Kwok essay in nearly a year”) — friendly-network amplification, not strict third-party. Flag but don’t disqualify; the analysis is usually substantive.
Series structure in WDoO: themes recur week-to-week (geothermal Quaise → Fervo S-1; AI-for-bio bottleneck → Medra throughput answer). Cross-link adjacent WDoO entries.
Inbox vs archive count gap: Original “201+ in history” estimate was Substack-archive-based. Real Gmail inbox has ~28 messages from-subscription-date forward. Confirmed pattern across multiple senders.

Patterns to watch for in other senders (not yet backfilled)

Stratechery: known to have “This Week in Stratechery” weekly digest + interview-format issues + daily updates. Likely a mix of formats per sender. Needs format-detection per-issue, not per-sender.
Every: multi-author publication, each issue may have a different byline. Treat each issue as its own author.
ARK Invest: pure investment commentary. Track for bias since it’s a fund promoting its own positions.
Ship30for30 / Essay Architecture: writing craft — useful meta, but heavy on marketing-style CTAs. Sponsor detection will be noisy.
Data Elixir: curation-only once it arrives. Watch for bias toward the tools/platforms Lon is sponsored by.

Skill design — key decisions

Skill over cron for now. The /process-newsletter watch invocation is user-triggered, not scheduled, until we have confidence it doesn’t burn context or generate noise.
Always-flag, never-filter. Sponsors and bias get disclosed in frontmatter and body, never used to skip an article. The reader needs to see the angle.
Deep-fetch cautiously. Max 2 link follows per curation issue, only for third-party links clearly relevant to RDCO topics. No paywall traversal.
Vault-path discipline. All notes under 06-reference/<YYYY-MM-DD>-<sender-slug>-<topic-slug>.md. No exceptions. The filename convention is what lets the skill detect “already-filed” without duplicating work.
Tracked authors feed Task #4. When a guest post or curation link surfaces a new author worth following (Dylan Anderson from the Mar 25 SDG issue, Olga Berezovsky from Jan 23), they go to the CRM candidate list, not directly into a contact file.

Lessons from the SDG prototype

What worked well:

Oldest-first order let me notice the series structure and forward-link properly.
Per-article “Mapping against Ray Data Co” section forced me to justify why I was filing. Several articles passed the mapping bar weakly — which is a useful signal that the discipline is tight, not loose.
Sponsor disclosure as a top-level frontmatter field means future filtering/reporting is trivial.
The “This one matters most” framing on the Mar 25 “You Will Know Nothing” article surfaced a real strategic tension for RDCO that I don’t think I’d have articulated without being forced to file the note.

What to improve for the next sender:

Writing 8 articles in one session is context-expensive. Next time: batch fetches, write notes in parallel where possible, cap at 4 articles per session.
Build a sender-slug lookup (just a dict mapping email → folder prefix) so filename generation is deterministic. SDG slug = seattle-data-guy, Stratechery slug = stratechery-ben-thompson, etc.
Consider whether the 06-reference/ folder will get unwieldy as the backfill grows. At ~10 K senders × ~200 messages each = ~2000 files. May need a per-author subfolder once we’re past ~50 total.
The Gmail API’s resultSizeEstimate caps at ~200, so “how big is the backfill” requires actual pagination. Not a blocker; just know it.

Next actions (pending founder approval)

Backfill the next K sender — recommend Stratechery (201+ messages, highest expected signal density, single author for consistent assessment voice).
Set up the watch cron loop once we’ve done 2-3 senders and trust the pattern.
Revisit the 06-reference/ folder structure question at ~50 files.
Handle Data Elixir once founder re-subscribes.

../../06-reference/2026-01-05-seattle-data-guy-data-pipeline-patterns — first backfilled article
../../06-reference/2026-03-25-seattle-data-guy-know-nothing-and-be-happy — the most strategically relevant of the eight
Task #38 in the task list — “Build /process-newsletter skill (multi-source)” — in_progress
Task #4 — “Add Twitter users to CRM workflow” — where newly-surfaced tracked authors feed

Process Newsletter — project README