improve process newsletter

Diarized Profile

SKILL: process-newsletter
RUNS SINCE LAST IMPROVE: 409 emails, 9 senders, ~244 vault notes filed
WHAT WORKS WELL:
  - Sponsor detection caught real relationships: Estuary/SDG, cossay/Not Boring, ARK self-promotion, Monologue/Every
  - Oldest-first ordering preserved series context (SDG pipeline series, Stratechery narrative arcs)
  - Discovery + batch decomposition worked — 127 Ship30 emails cleanly split into 7 batches, all autonomous-runnable
  - Skip rates were appropriate: Ship30 89% skip (correct — sales funnel), SDG 0% (correct — all substantive), Stratechery ~2% (1 sick day)
  - Assessment note length is in range: ~183-485 words per note, mostly hitting the 150-300 target with some longer hybrid notes justified
  - RDCO mapping section consistently present and specific (SDG pipeline → autoinv audit, Ship30 prompting → Sanity Check production, AE Roundup Iceberg → lakehouse advisory)
  - Sub-agent parallel processing (max 3 concurrent) kept parent context lean during batch runs
  - Frontmatter schema is consistent across all 9 senders — date, type, source, author, format, sponsored, tags all present
WHAT'S MEDIOCRE:
  - M1: Curation self-cross-promo detection was inconsistent across batches for hybrid newsletters. SDG "Articles Worth Reading" was sometimes labeled as self-cross-promo, sometimes not — depended on whether the sub-agent happened to compare the linked domain to the sender. No explicit instruction to always check.
  - M2: Ship30 triage was correct in aggregate (89% skip) but early batches filed some "writing-tagged but actually sales" emails before the pattern was recognized. The skill says "always-flag, never-filter" for sponsors, but has no guidance for "this entire email is a sales funnel wrapper with a content veneer."
  - M3: Every's multi-author detection worked but author extraction from HTML body was inconsistent — some notes credited "Every" as author instead of the actual byline (Dan Shipper, Nathan Baschez, etc.). The skill says "real author" in frontmatter but doesn't explain where to find it for multi-author pubs.
  - M4: The "max 2 deep-fetches per curation issue" rule was never tested — no curation newsletter actually triggered a deep-fetch during the full 409-email backfill. The rule exists in theory but has no operational validation.
  - M5: RDCO mapping section quality varies — best notes (SDG pipeline, Ship30 prompting) make specific connections to our work; weaker notes (some ARK investment commentary, some Not Boring essays) say generic things like "relevant to RDCO's AI approach" without specifying what approach or why.
WHAT FAILS:
  - No outright failures — all 244 notes filed correctly, no data loss, no misclassification at the sender level
GAP:
  - Skill claims "max 2 deep-fetches per curation issue" but this was never exercised. AE Roundup is pure curation and processed 6 issues without a single deep-fetch. Either the threshold is too high or the relevance filter is too conservative.
  - Skill doesn't address the "sales funnel newsletter" pattern at all. Ship30 is the first sender where the majority of emails should be skipped — the skill only talks about skipping already-filed duplicates, not content-level triage.
  - Skill doesn't specify how to extract author from multi-author publications (Every, potential future additions).

Proposed Changes

Pattern: M1 — inconsistent self-cross-promo detection in curation sections

Old text (Step 3, curation-section self-promo bullet):

- **Curation-section self-promo** — a "curation" link pointing to the same author's prior content. Label as self-cross-promo, not real curation.

New text:

- **Curation-section self-promo** — a "curation" link pointing to the same author's prior content or same publication domain. For every link in a curation section, compare the linked domain and author to the newsletter sender. If they match (e.g. SDG linking to seattledataguy.substack.com), label explicitly as `[self-cross-promo]` in the curation notes. This check must happen for every curation link, not just suspicious ones.

Status: APPLIED — low risk, tightens existing instruction.

Change 2 (LOW RISK) — Add content-level triage for sales-funnel senders

Pattern: M2 — no guidance for skipping emails that are pure sales with a content wrapper

Location: After “Process one message” Step 2 (Classify), add new Step 2.5

New text:

### Step 2.5 — Triage: skip or process?

After classifying the format, assess whether the email has extractable intellectual content:

- **Process** if the email teaches a technique, makes an argument, presents data, or curates external links with commentary.
- **Skip** if the email is primarily: a sales CTA (bootcamp launch, discount reminder, replay link), an event logistics email, a drip-sequence nurture email with motivational framing but no extractable technique, or a duplicate/repackage of content already filed.

When skipping, log the message ID and skip reason in the batch summary. Do not create a vault note for skipped emails.

This is distinct from sponsor detection — a sponsored article with real content gets processed and flagged. A sales email with no content gets skipped entirely.

Status: APPLIED — low risk, codifies the triage pattern that emerged naturally during Ship30 processing.

Change 3 (LOW RISK) — Add multi-author publication guidance

Pattern: M3 — author extraction inconsistent for Every

Location: Step 5 frontmatter template, after author: field

New text (added as a note below the frontmatter block):

**Multi-author publications** (e.g. Every): extract the actual byline author from the email body or subject line, not the publication name. Look for patterns like "by Dan Shipper" in the subject, a byline in the first 200 chars of the body, or an `X-Author` header. If no individual author is identifiable, use `<Publication> (staff)`.

Status: APPLIED — low risk, prevents a known mis-attribution pattern.

Change 4 (LOW RISK) — Strengthen RDCO mapping quality bar

Pattern: M5 — some mapping sections are generic (“relevant to RDCO’s AI approach”)

Location: Step 5, body section for “Mapping against Ray Data Co”

Old text:

- `## Mapping against Ray Data Co` — the load-bearing section. Where does this reinforce existing discipline, surface a gap, or contradict something we already believe? No mapping = no reason to file.

New text:

- `## Mapping against Ray Data Co` — the load-bearing section. Where does this reinforce existing discipline, surface a gap, or contradict something we already believe? **Specificity test:** the mapping must name a concrete RDCO artifact (a script, a vault note, a project, a newsletter issue, a decision) or a specific open question. "Relevant to our AI approach" fails. "Connects to the autoinv bias-audit gap from [[note]]" passes. No mapping = no reason to file.

Status: APPLIED — low risk, adds a concrete quality test to an existing instruction.

Change 5 (STRUCTURAL — QUEUED) — Deep-fetch policy revision

Pattern: GAP — deep-fetch rule was never exercised across 409 emails

The “max 2 deep-fetches per curation issue” rule was reasonable in theory but zero deep-fetches happened during the full backfill. Two possible explanations:

The relevance threshold is too conservative (requirement #2: “clearly crosses a relevance threshold”) — sub-agents may be interpreting “clearly” as “obviously,” which filters out most candidates.
Curation newsletters in our whitelist (AE Roundup, SDG’s curation section, Not Boring Friday dose) link to content that’s already summarized well enough in the newsletter blurb that deep-fetching adds no value.

Proposal: Lower the deep-fetch threshold from “clearly crosses” to “plausibly crosses” the relevance threshold, and add an explicit instruction: “If processing a curation newsletter and zero links triggered a deep-fetch, note this in the batch summary as a signal to review whether the threshold is too high.”

Status: QUEUED for founder review — changes the economics of the skill (more token spend per curation issue).

Change 6 (STRUCTURAL — QUEUED) — Folder structure at scale

The vault now has ~244 newsletter notes in 06-reference/ alongside ~240 other reference files (moonshots, books, misc). Total is ~489 files. The README predicted this would get unwieldy at ~50 files and suggested per-author subfolders.

Proposal: Introduce 06-reference/newsletters/<sender-slug>/ subfolders once the total file count crosses 500. This would require updating the skip-detection logic (which checks 06-reference/ for existing files) and the file path convention in Step 5.

Status: QUEUED for founder review — structural change to vault layout, affects multiple skills and cross-links.

Changes Applied to Skill File

Curation self-cross-promo check — tightened Step 3 to require domain/author comparison for every curation link
Content triage step (2.5) — added explicit skip criteria for sales-funnel emails
Multi-author extraction — added guidance after frontmatter template for multi-author publications
RDCO mapping specificity test — added concrete artifact requirement to the mapping section instruction
Changelog entry — added at bottom of skill file

Changes Queued (Structural — Needs Founder Review)

Deep-fetch threshold revision — lower from “clearly crosses” to “plausibly crosses,” add zero-fetch reporting
Folder restructuring — per-sender subfolders when 06-reference/ crosses 500 files

/improve — process-newsletter

Diarized Profile

Proposed Changes

Change 1 (LOW RISK) — Add explicit curation self-cross-promo check

Change 2 (LOW RISK) — Add content-level triage for sales-funnel senders

Change 3 (LOW RISK) — Add multi-author publication guidance

Change 4 (LOW RISK) — Strengthen RDCO mapping quality bar

Change 5 (STRUCTURAL — QUEUED) — Deep-fetch policy revision

Change 6 (STRUCTURAL — QUEUED) — Folder structure at scale

Changes Applied to Skill File

Changes Queued (Structural — Needs Founder Review)