Process Newsletter — project README
Goal
Ingest email newsletters from a whitelisted set of senders into the vault as structured assessment notes, with bias and sponsor flagging. Separate from Substack-specific tooling — this works on any newsletter that lands in ben@raydata.co regardless of publisher (Substack, Ghost, Mailchimp, self-hosted).
Companion skill: ~/.claude/skills/process-newsletter/SKILL.md.
Status
- Prototype complete as of 2026-04-11 using SeattleDataGuy’s 9-email history as the proving ground (8 backfilled, 1 was pre-existing).
- Skill file drafted. Whitelist locked in.
- Remaining K senders not yet backfilled (pending founder go-ahead for the next batch).
Whitelist — locked in 2026-04-11
K = keep, backfill history + ongoing watch
| Sender | Newsletter | Typical format | Notes |
|---|---|---|---|
email@stratechery.com | Stratechery (Ben Thompson) | thought-leadership | 60 actual in inbox (2026-04-12 discovery; 3 batch tasks created). Original “201+ in history” estimate was archive-based, NOT inbox-bounded. |
seattledataguy@substack.com | SeattleDataGuy (Ben Rogojan) | hybrid | ✅ backfill complete (8 articles + 1 pre-existing) |
notboring@substack.com | Not Boring (Packy McCormick) | hybrid (long-form + Friday optimism curation) | 28 actual in inbox (2026-04-30 discovery; 1 batch task created, 26 already filed). Original “201+ in history” estimate was archive-based. |
practicaldatamodeling@substack.com | Practical Data Modeling (Joe Reis) | thought-leadership, series-based | 32 actual in inbox (2026-04-30 discovery; 1 new batch task created, 23 already filed). Original “~20+” was closer but still off. ⚠️ paid sub lapsed 2026-04-24 — flagged to founder for resubscribe decision. |
analyticsengineeringroundup@substack.com | Analytics Engineering Roundup | curation | ~20+ in history (estimate — not yet discovery-scanned; assume lower per inbox-bounded pattern below) |
hello@every.to | Every | multi-author thought-leadership | 201+ in history (estimate — not yet discovery-scanned; assume MUCH lower per inbox-bounded pattern below) |
hello@ship30for30.com | Ship30for30 (Start Writing Online) | writing craft / marketing | 30+/180d |
writewithai@substack.com | Write With AI | writing with AI tools | 8+/180d |
michaeldean9@substack.com | Essay Architecture (Michael Dean) | essay writing craft | curation-heavy |
ark@arkinvest.com | ARK Invest (Cathie Wood) | investment commentary | weekly stock commentary |
newsletter@commoncog.com | Commoncog (Cedric Chin) | thought-leadership, series-based | ~201+ in history estimate — actual is inbox-bounded (founder re-subscribed 2026-04-15, so inbox has from-that-date forward). Operator’s field manual; tacit knowledge, expertise, sensemaking. Highly relevant to RDCO agent-deployer positioning. |
⚠️ Count-before-budget rule (added 2026-04-30)
The original “N+ in history” estimates in the K table were Substack-archive-based, NOT Gmail-inbox-bounded. Real inbox counts have come back substantially smaller (Stratechery: 60 vs 201+ estimate; PDM: 32 vs 20+ estimate; Not Boring: 28 vs 201+ estimate). The gap is because Gmail only has from-subscription-date forward, not the sender’s full archive.
Rule for remaining un-scanned senders (Every, Commoncog, Analytics Engineering Roundup, ARK Invest, Write With AI, Ship30for30, Essay Architecture): run discovery with --dry-run semantics first — count messages in inbox BEFORE planning batch sizes. If count is small (<20), skip the batch-task overhead and process inline via Mode 3 (Backfill, legacy small-sender path). Don’t allocate 10+ batch tasks for a sender that only has 25 messages in the inbox.
Implication: the total backfill work is significantly smaller than the README originally implied. Prioritize the per-sender deep-fetch quality (sponsor detection, RDCO mapping discipline) over volume planning.
F = follow-forward only, no backfill (watch from now onward)
| Sender | Newsletter | Typical format |
|---|---|---|
theinnermostloop@substack.com | Innermost Loop (Alex Wissner-Gross) | thought-leadership |
dataengineeringcentral@substack.com | Data Engineering Central | thought-leadership |
dataengineeringweekly@substack.com | Data Engineering Weekly (Ananth Packkildurai) | curation |
technically@substack.com | Technically | thought-leadership |
news@alphasignal.ai | AlphaSignal | curation (AI/ML news) |
lon@dataelixir.com | Data Elixir | curation (data science/ML news) — founder will resubscribe to ben@raydata.co; currently hits personal inbox |
semistructured@substack.com | Semi-Structured (Jonathan Natkins) | thought-leadership (data infrastructure for AI agents) — added 2026-04-12 |
Known sender-specific gotchas (learned from SDG backfill)
SDG — SeattleDataGuy (fully backfilled)
- Sponsor pattern: Estuary (disclosed adviser relationship) shows up in the top-of-article pre-amble ~60% of the time. SDG also switches to explicit self-consulting CTAs (“today’s article is sponsored by me, the Seattle Data Guy!”) — sometimes at top, sometimes mid, sometimes at bottom. Skill should flag both forms and not key detection off placement.
- Curation section (“Articles Worth Reading”) is roughly 50/50 self-cross-promo vs third-party. He frequently links his own prior posts in the curation slot. The skill should label links by comparing the linked domain/author to the newsletter sender.
- Series structure: SDG runs multi-month series (2026 “data pipelines” series, currently 6 articles in). Later articles reference earlier vocabulary. Backfill order matters — go oldest-first so cross-references make sense.
- Guest posts exist — Olga Berezovsky hosted on Jan 23. Guest posts come with cross-promotion for the guest’s own newsletter in the top slot (different from the Estuary pattern).
PDM — Practical Data Modeling (Joe Reis) — partial backfill, paid sub lapsed 2026-04-24
- Substack expiry sequence (3-email cadence): “ending in a week” → “ending tomorrow” → “subscription ended.” All three are pure billing/admin emails with zero extractable content. Skip on subject-line pattern alone in Mode 4 watch — don’t read the body. The 2026-04-30 PDM batch [1-5] confirmed this template.
- Pulse-survey launch vs results mislabel risk: A “New pulse survey just dropped” email is the LAUNCH (CTA to take the survey), NOT the results. Joe Reis’s pattern is to present findings at a keynote (e.g., Stockholm May 7 2026) and publish a results-with-dataset post afterward. When discovery pre-triages a “pulse survey” subject, default the assumption to “launch CTA” and look for the follow-up results post 2-4 weeks later. The results post is the high-signal artifact; the launch is borderline-skip with one extractable data point at most.
- Series structure: Long-running MMA chapter series (Ch 1-16 already filed, Feb-Apr 2026). Cross-references rely on chapter order — process oldest-first when backfilling.
- Paid-tier risk on results posts: Reis ships results-with-dataset posts as paid-tier content. Watch the post-Stockholm May 7+ window — if
plain_textreturns paywall stub, founder needs to resubscribe (lapsed 2026-04-24) before that batch can process the substantive body.
Not Boring (Packy McCormick) — partial backfill, ongoing watch
- Hybrid cadence: Mon long-form essay + Fri “Weekly Dose of Optimism” curation. Both formats live in the same inbox; classify per-issue, not per-sender.
- Co-written essays = founder pitches. ~4 of 11 historical long-form essays follow the “Packy hands the keyboard to a portfolio-adjacent founder” pattern (Venezuela, Great Blue Frontier, etc.). Treat the company list and “what’s missing in the platform map” framing as the co-author’s go-to-market collateral, not neutral analysis. The underlying framework is usually still useful; the company endorsements are not. Frontmatter should byline both authors AND surface the structural bias in a Sponsorship section even when no explicit sponsor block exists. The Apr 23 2026 Great Blue Frontier essay (Will O’Brien / Ulysses Maritime) is the canonical exemplar — Packy explicitly disclaimed Not Boring Capital is NOT an investor, but the entire essay is a Series-A-adjacent positioning piece.
- Two layers of sponsor on co-written essays: (a) explicit paid sponsor block (Framer, SVB, etc.) — clean disclosure, low bias; (b) structural sponsor — the co-author’s company. Always disclose both, separately.
- WDoO portfolio-disclosure ambiguity: Packy doesn’t always disclose his angel positions in the curation slot. Absence of “Not Boring Capital is an investor” caveat ≠ no relationship. When a curation item is enthusiastic about a specific company (Apr 24 2026 Medra was the trigger), do a quick public-investor-list verification before treating the framing as neutral. Note the verification result in the assessment (“verified Not Boring Capital not in public Series A list” or “could not verify, treat as messenger”).
- Sister-author cross-promo in WDoO: Kevin Kwok essays appear in the curation slot (“first Kevin Kwok essay in nearly a year”) — friendly-network amplification, not strict third-party. Flag but don’t disqualify; the analysis is usually substantive.
- Series structure in WDoO: themes recur week-to-week (geothermal Quaise → Fervo S-1; AI-for-bio bottleneck → Medra throughput answer). Cross-link adjacent WDoO entries.
- Inbox vs archive count gap: Original “201+ in history” estimate was Substack-archive-based. Real Gmail inbox has ~28 messages from-subscription-date forward. Confirmed pattern across multiple senders.
Patterns to watch for in other senders (not yet backfilled)
- Stratechery: known to have “This Week in Stratechery” weekly digest + interview-format issues + daily updates. Likely a mix of formats per sender. Needs format-detection per-issue, not per-sender.
- Every: multi-author publication, each issue may have a different byline. Treat each issue as its own author.
- ARK Invest: pure investment commentary. Track for bias since it’s a fund promoting its own positions.
- Ship30for30 / Essay Architecture: writing craft — useful meta, but heavy on marketing-style CTAs. Sponsor detection will be noisy.
- Data Elixir: curation-only once it arrives. Watch for bias toward the tools/platforms Lon is sponsored by.
Skill design — key decisions
- Skill over cron for now. The
/process-newsletter watchinvocation is user-triggered, not scheduled, until we have confidence it doesn’t burn context or generate noise. - Always-flag, never-filter. Sponsors and bias get disclosed in frontmatter and body, never used to skip an article. The reader needs to see the angle.
- Deep-fetch cautiously. Max 2 link follows per curation issue, only for third-party links clearly relevant to RDCO topics. No paywall traversal.
- Vault-path discipline. All notes under
06-reference/<YYYY-MM-DD>-<sender-slug>-<topic-slug>.md. No exceptions. The filename convention is what lets the skill detect “already-filed” without duplicating work. - Tracked authors feed Task #4. When a guest post or curation link surfaces a new author worth following (Dylan Anderson from the Mar 25 SDG issue, Olga Berezovsky from Jan 23), they go to the CRM candidate list, not directly into a contact file.
Lessons from the SDG prototype
What worked well:
- Oldest-first order let me notice the series structure and forward-link properly.
- Per-article “Mapping against Ray Data Co” section forced me to justify why I was filing. Several articles passed the mapping bar weakly — which is a useful signal that the discipline is tight, not loose.
- Sponsor disclosure as a top-level frontmatter field means future filtering/reporting is trivial.
- The “This one matters most” framing on the Mar 25 “You Will Know Nothing” article surfaced a real strategic tension for RDCO that I don’t think I’d have articulated without being forced to file the note.
What to improve for the next sender:
- Writing 8 articles in one session is context-expensive. Next time: batch fetches, write notes in parallel where possible, cap at 4 articles per session.
- Build a sender-slug lookup (just a dict mapping email → folder prefix) so filename generation is deterministic. SDG slug =
seattle-data-guy, Stratechery slug =stratechery-ben-thompson, etc. - Consider whether the
06-reference/folder will get unwieldy as the backfill grows. At ~10 K senders × ~200 messages each = ~2000 files. May need a per-author subfolder once we’re past ~50 total. - The Gmail API’s
resultSizeEstimatecaps at ~200, so “how big is the backfill” requires actual pagination. Not a blocker; just know it.
Next actions (pending founder approval)
- Backfill the next K sender — recommend Stratechery (201+ messages, highest expected signal density, single author for consistent assessment voice).
- Set up the
watchcron loop once we’ve done 2-3 senders and trust the pattern. - Revisit the
06-reference/folder structure question at ~50 files. - Handle Data Elixir once founder re-subscribes.
Related
- ../../06-reference/2026-01-05-seattle-data-guy-data-pipeline-patterns — first backfilled article
- ../../06-reference/2026-03-25-seattle-data-guy-know-nothing-and-be-happy — the most strategically relevant of the eight
- Task #38 in the task list — “Build /process-newsletter skill (multi-source)” — in_progress
- Task #4 — “Add Twitter users to CRM workflow” — where newly-surfaced tracked authors feed