01-projects / bookstore-for-agents

concept seed

Mon May 04 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·concept-seed ·status: idea-only
bet-candidatel5intelligence-railsagentsbookslicensingmicropayments

Bookstore for Agents - concept seed

Filed 2026-05-05 from a founder iMessage during the TDD-canon-filing thread. Not a commitment. Idea-stage only. Future-self should not romanticize this as a tech play; the bottleneck is publisher licensing, not infrastructure.

The spark

Founder, 2026-05-05 14:59 ET, while answering the procurement question on TDD canon (which books are open, which are paywalled):

“It’s sparking an idea. Agents need a bookstore. Like how Amazon started with books. So many of these books have been digitized, we just need to make them accessible to agents.”

The thesis (compact)

Agents currently access knowledge via three terrible paths:

  1. Training data - frozen, fuzzy, no provenance, often legally murky.
  2. Open web retrieval - paywall-blocked, low-fidelity scrapes, contradictory signals.
  3. Hand-curated context (vaults, RAG) - works, but the operator has to do the curation themselves.

Books - the densest, most-edited, highest-trust knowledge artifacts humanity ships - are mostly absent from agent context except via the murkiest of the three paths (training data, often not consented to). Domain expertise that requires book-grade input (law, medicine, finance, engineering, accounting, primary research) is consequently weaker than it should be.

The Amazon parallel is sharp: Amazon started with books because books are a homogeneous SKU with deep latent demand and low operational complexity per unit. “Books for agents” sits at the same shape: API-accessible book content, queryable per-chapter or per-passage, with provenance, citation, and royalty enforcement.

Where it sits in the L5 thesis

Per Karl Mehta’s “Commoditization of LLM Models” (filed ~/rdco-vault/06-reference/2026-05-04-karlmehta-llm-commoditization-intelligence-rails.md), durable value moves up the stack from raw inference into orchestration / evals / RAG / memory / vertical applications. Books-for-agents sits between the model layer and the vertical-app layer, as a content-access rail. Specifically:

If this layer becomes a clean primitive, the vertical-app explosion above it accelerates. That’s the same dynamic AWS unlocked for software services starting in 2006.

The bottleneck (read this before romanticizing)

The hard part isn’t tech. The hard part is publisher licensing.

The companies that win here are publisher-relations companies wearing a tech-company costume. The closest analog is Spotify vs the music labels - Daniel Ek’s job for the first five years was deal-making, not engineering.

Existing players (rough scan)

The market between O’Reilly (subscription-for-humans) and Project Gutenberg (PD-only) is mostly empty. That’s the gap.

Targeting-system filter (per feedback_targeting_system_prioritization_filter)

Before this becomes a real bet, apply the four-layer filter:

LayerStatus
TargetingIdentifiable: vertical-AI startups in legal / medical / financial / engineering. Specifically the ones that are stuck because their agents don’t have book-grade context. Need 5-10 customer interviews to validate the pain is acute.
InstrumentationClear metrics: API queries per month, content licensing revenue, gross margin per query, publisher LTV, customer LTV. Standard SaaS shape.
ToolsBuild cost low (chunking + embedding + serve = commoditized). License acquisition cost is the binding constraint.
Feedback loopHard to test pre-launch: would publishers sign a contract before seeing customer demand? Would customers commit before seeing licensed content? Classic chicken-egg.

Verdict: idea passes the smell test, but bottleneck (publisher licensing) is so dominant that this is more a publisher-relations bet than a tech bet. Founder is currently optimized for tech-bet shape (lean dev cycles, small team, automation-leveraged). Publisher-relations bet shape is opposite: BD-heavy, slow, capital-intensive, lawyer-dense.

Open questions for future analysis

  1. Could RDCO start with public-domain only (Project Gutenberg + JSTOR open content + arXiv + open textbooks) and prove the agent-API shape before tackling publisher licensing? That gives a real product and customer validation without the licensing bottleneck.
  2. Is there a wedge with a single publisher (e.g., O’Reilly, who already monetizes book access and is technical-first) where a partnership lets RDCO prove the model on one catalog before going broad?
  3. What’s the per-query royalty rate that publishers would accept? A back-of-envelope: at $0.001/query and 10M queries/month that’s $10K/month royalty per publisher - too small to move the needle for a Big Five publisher but maybe interesting for niche / academic / professional publishers.
  4. Does MPP/Tempo make this easier specifically? Per-query micropayments at fractions-of-a-cent that aggregate per publisher per month is exactly what Tempo enables. The publisher dashboard could be “here’s $X this month from Y queries across Z titles.”
  5. Who owns the agent-side UX? Is this a B2B API for agent builders (Anthropic, OpenAI, agent-platform companies) or B2C for end users running their own agents? Different go-to-market entirely.

The PD-only wedge (added 2026-05-05 PM after founder iteration)

Founder pushed back on the publisher-relations bottleneck framing with two questions:

“What if we started with books that are free of copyright? Public domain? One of the AI bets is that there are latent discoveries to be made. Dots to be connected from prior research and writing that no one was able to see all at once. Surely some of this stuff is hanging out there in the public domain. It may not yet be digitized, which would create a different bottleneck than publisher relations - or does digitizing it somehow reset the publishers copyright?”

The PD-only wedge is real and substantially changes the bet shape. Two answers + a recap of the corpus + a wedge-product candidate.

Settled US law (and EU as of 2019):

Caveat: if a modern edition adds editorial work (introductions, footnotes, scholarly annotations, modern translations of foreign-language PD works), the modern additions ARE copyrighted. RDCO must either use the original PD source directly, do its own digitization, or carefully strip modern editorial additions when ingesting from copyrighted-edition scans.

What’s actually in the PD corpus (US, 2026)

The “rolling 95-year wall” puts everything published before 1930 in PD as of 2026. Plus categorical exclusions and open-access content:

CorpusApproximate scaleQualityStatus
Project Gutenberg~70,000 booksClean, proofed textAlready digitized
HathiTrust Digital Library~17M volumes total, several million PDVariable OCR, library-grade scanPartly digitized
Internet Archive Open LibraryMillions of scanned booksVariable, mixed PD/in-copyrightPartly digitized
USPTO Patent Bulk Data~12M US patents (1790-present)Mixed OCR for old, structured XML for newFully digitized
CourtListener / Caselaw Access ProjectEvery US federal court opinion everClean, structuredFully digitized
PubMed Central Open Access~7M biomedical papersCleanFully digitized
arXiv~2.4M open preprints (mostly STEM)CleanFully digitized
JSTOR Early Journal Content~500K pre-1923 journal articlesCleanFully digitized
Library of Congress digitalMillions of historical documentsVariablePartly digitized
NASA Technical Reports Server~1M+ reportsCleanFully digitized
US gov publications (FDsys / GovInfo)Congressional Record, Federal Register, GAO reportsCleanFully digitized

What PD does NOT cover: post-1929 fiction, modern textbooks, modern bestsellers, most current trade non-fiction, modern reference works (medical, legal, engineering, etc.). PD-only wedge does NOT compete with O’Reilly Learning Platform on technical books, etc. It competes in different verticals.

The latent-discoveries angle (founder’s thesis, sharpened)

Founder’s framing: “there are latent discoveries to be made. Dots to be connected from prior research and writing that no one was able to see all at once.”

This is genuinely sharp. The PD corpus has never been agent-queryable as a unified surface. Specifically:

New bottleneck under PD-only: corpus assembly + OCR + metadata, not licensing

The work shifts from BD/legal to engineering and curation:

  1. OCR quality is uneven. HathiTrust and Internet Archive scans range from clean to barely-readable. Project Gutenberg is already proofed. Modern OCR (Tesseract 5, GCV, Azure OCR, OpenAI Vision) on old scans gets 90-99% accuracy depending on print quality. Cleanup is real labor but tractable.
  2. Metadata cleanup. Bibliographic records are inconsistent across sources (HathiTrust uses MARC; Internet Archive uses Dublin Core; Gutenberg has its own schema). Unifying these into one queryable graph is a corpus-engineering project.
  3. Embedding + serving infrastructure. Commoditized (Pinecone, Weaviate, pgvector, OpenAI embeddings). Standard.
  4. Citation + provenance. This is where agent-native bookstore differs from generic RAG: every passage returned must carry full bibliographic provenance (author, title, publication date, edition, page number, paragraph). The agent caller needs to be able to verify and cite. This is solvable but requires careful schema.
  5. Adversarial pollution. PD corpus contains a lot of garbage (advertising tracts, vanity-press 19th-century books, partisan hackwork). Quality filtering at ingestion is real work.

Tractable for a small team. No publisher BD. No multi-million-dollar licensing rounds. The capital and timeline shift from “BD-heavy 3-year ramp” to “engineering-heavy 6-month MVP.”

Wedge product candidates (pick ONE to validate)

Rank-ordered by likely tractability + revenue:

  1. USPTO patent prior-art search agent. Corpus is fully digitized + structured. Customer pain is severe (patent attorneys, R&D teams, IP litigators). Existing tools (Google Patents, USPTO PAIR, paid services like PatentSight) are weak on semantic search. Agent-native query against the full corpus with citation is a clear product. B2B, $X00-X,000/month/seat plausible. Direct revenue path.

  2. Legal research agent over PD federal case law. CourtListener + Caselaw Access Project. Customers: solo + small-firm attorneys priced out of Westlaw / Lexis. Lower ARPU than enterprise legal tech, but a TAM nobody serves well. Risk: legal liability for incorrect citations.

  3. Forgotten-medicine research agent for pharma R&D. Pre-1929 clinical literature + PubMed open access. Customers: pharma R&D teams, NIH-funded labs, biotech founders. Niche but high-willingness-to-pay. Risk: hard to validate value without 6-month customer trials.

  4. Cross-disciplinary synthesis agent. Most ambitious; least focused. The “let an agent connect dots across all PD content” framing. Risk: undefined customer; hard to find the wedge.

Recommendation if pursuing: wedge product 1 (USPTO patent prior-art search). Smallest, sharpest, fastest to validate, clear customer pain, fully-digitized corpus, no metadata cleanup nightmare, immediate revenue path, also serves as proof-of-concept for the agent-citation infrastructure that bigger wedges need.

Live datapoint: the owner-purchases-then-shares pattern (added 2026-05-05 11:30 ET)

While we were discussing this idea, the founder bought Beck’s “TDD by Example” PDF on his own Pearson account, then dropped it into Discord for Ray to ingest. Ray:

  1. Downloaded the PDF
  2. Stored it in ~/Documents/library/books/ (out of the git-tracked vault tree to avoid copyright leakage)
  3. Created ~/rdco-vault/04-tooling/personal-library-index.md as a vault-side index pointing to the non-tracked PDF path
  4. Dispatched a sub-agent to upgrade the prior vault note from reconstructed-from-web to source_fidelity: primary-source, citing Beck directly and using exact terminology (Fake It, Triangulate, etc.)

This is the consumer-side bookstore-for-agents pattern in microcosm. Workflow:

Implication for the wedge analysis: the bookstore-for-agents v0 might not be a publisher-relations bet OR a PD-corpus bet at all. It might be a personal-library-as-agent-context product:

This shifts the shape AGAIN: from publisher-relations bet (slow, capital-intensive) to PD-corpus bet (engineering-heavy) to personal-library-RAG bet (consumer SaaS, lean, immediately validatable).

Existing players in this exact space: Readwise (highlights + summaries), Calibre (DIY ebook library management), Notion AI (notes + RAG over your stuff), but nobody has built the agent-API-over-personal-library primitive cleanly.

This is the third candidate wedge alongside USPTO patent prior-art and federal case law. Worth considering.

Recommendation (revised 2026-05-05 PM)

The PD-only framing turns this from a publisher-relations bet (BD-heavy, slow, capital-intensive) into a corpus-engineering bet (engineering-heavy, weeks-not-years, lean-team). That fits RDCO’s operating model.

Status remains idea-only but with a clear pre-validation path that costs near-zero Ray cycles:

  1. 5 customer-discovery calls with patent attorneys / IP litigators. Single question: “what would you pay for an agent that can semantically search the full USPTO corpus for prior art with citation, given existing tools?” Answers the targeting layer.
  2. Corpus reachability test (1-2 hours of Ray work): confirm USPTO bulk data download endpoint, file sizes, format, license terms. Confirm CourtListener API is open and rate-tolerable.
  3. MVP scope definition based on (1) and (2). Then decide.

Founder controls (1). Ray can do (2) and (3) in idle cycles without full bet activation.

Park decision: not a current Ray-priority bet, but no longer parked indefinitely - moved to “pre-validation phase, founder-gated by 5 discovery calls.”

Wedge shape-test: USPTO patent prior-art search agent (added 2026-05-05 PM)

This section pressure-tests wedge candidate #1 (“USPTO patent prior-art search agent”) with concrete evidence from the live USPTO data infrastructure, the prior-art search market, and a back-of-envelope unit-economics check. Goal: give the founder enough signal to either start 5 customer-discovery calls or fall out of love.

1. Corpus reachability (verified via live fetch)

Bulk endpoint. USPTO retired the legacy bulkdata.uspto.gov host (now ECONNREFUSED on direct fetch) and migrated everything to the Open Data Portal at https://data.uspto.gov/. The portal launched February 2025 and the legacy beta API hub at developer.uspto.gov is scheduled for decommissioning on May 29, 2026.

Datasets confirmed available on the ODP (each is a separate dataset path under https://data.uspto.gov/bulkdata/datasets/):

Dataset slugContentsFormat
PTGRXMLPatent Grant Full-Text Data, no imagesXML
appxmlPatent Application Full-Text Data, no imagesXML
ptblxmlPatent Grant Bibliographic / front-page onlyXML
pasdlPatent Assignment XML, daily ownership transfersXML
pedsxmlPatent Examination Data System (PAIR successor)XML

Source: USPTO ODP Bulk Data Directory at https://data.uspto.gov/bulkdata (search-result snippets; the live page is JS-rendered and returns empty content via WebFetch, so direct URL inspection of dataset metadata could not be verified).

Volume. US utility patent #12,000,000 issued June 4, 2024 (Sandberg Phoenix tracker at https://sandbergphoenix.com/the-u-s-patent-office-issues-patent-number-12000000/). At ~370K grants/year (USPTO Patents Dashboard, https://www.uspto.gov/dashboard/patents/), the corpus as of mid-2026 is roughly 12.6M granted utility patents plus design (~1.2M) and plant (~40K). Add published applications post-2001 and you’re at ~17-18M documents total. Average patent body is ~10-30KB of text plus drawings; full-text-only XML for the corpus is in the 150-300GB compressed range. Rough; needs verification by actually pulling a year’s archive.

License. USPTO Terms of Use page (https://www.uspto.gov/terms-use-uspto-websites): “most government-produced materials” on USPTO are public domain, “freely distributed and copied,” with requested acknowledgment. Caveat the page itself flags: a small fraction of patents embed third-party copyrighted material (drawings, photographs, embedded trademarks especially in design patents). For a prior-art search product this is a non-issue; we serve text + bibliographic data + pointers, not images.

Cleaner third-party mirror. The legacy patentsview.org redirects to https://data.uspto.gov/support/transition-guide/patentsview (the project was absorbed into ODP). The PatentsView Search API survives at https://search.patentsview.org/api/v1 with documented endpoints at https://search.patentsview.org/docs/docs/Search%20API/SearchAPIReference/. Auth is X-Api-Key. Rate limit: 45 requests/minute per key with 429 + Retry-After on overage (PatentsView forum: https://patentsview.org/forum/7/topic/781). The API is good for incremental updates and the disambiguated-inventor / assignee tables, but for full-text agent retrieval you want the bulk XML, not the API.

Cost to host and serve. Compressed XML ~200GB; uncompressed text ~600GB; chunked + embedded at 1024-dim float16 (~2KB per chunk) with ~5 chunks per patent gives ~150M chunks at 300GB raw vector storage. Production setup on commodity infra (S3 + a managed vector DB or self-hosted pgvector + Postgres for metadata): storage $50-150/month, embedding generation one-time $5-15K at OpenAI text-embedding-3-large rates ($0.13 per 1M tokens, ~5B tokens estimated for full corpus). Annual incremental embedding for new grants: <$500/yr. Compute for serving is the larger line item but scales with traffic, not corpus size.

2. Customer profile + ICP

Target customer for v0. Patent attorneys at small-and-mid-sized IP boutiques (3-30 attorneys), in-house IP counsel at growth-stage startups filing 5-50 patents/year, and freedom-to-operate analysts at biotech/medtech companies pre-launch. Specifically NOT BigLaw IP groups (locked into Westlaw/PatSnap multi-year contracts) and NOT solo inventors (no budget, single-search frequency).

Current price-per-search market anchor. A professional prior-art search runs $1,500-$4,000 per search, taking 10-15 hours of attorney/searcher time, per multiple industry guides (https://emanus.com/how-much-does-a-patent-cost/, https://www.upcounsel.com/patent-search-cost). Software / AI / medical tech sit at the high end. This is the human-labor anchor, not the tooling anchor.

Tooling pricing anchor. Questel Orbit Intelligence runs $15K-$500K/year depending on org size (Goodfirms vendor data at https://www.goodfirms.co/software/orbit-intelligence). LexisNexis PatentAdvisor and Innography do not publish per-seat pricing publicly; per industry conversation, BigLaw enterprise contracts are typically $10K-$25K/seat/year. PatSnap, Clarivate Derwent, and Anaqua sit in similar bands. Google Patents is free but covers 120M+ global publications with weak semantic search and no enterprise SLA (https://patents.google.com/).

Pain that an agent-native solution addresses. Existing tools are designed for human searchers running keyword + classification queries iteratively over hours. Three specific pains an agent surface fixes:

  1. Semantic recall. Keyword + CPC class search misses prior art that uses different terminology. Agent embedding-based retrieval surfaces conceptually adjacent patents the searcher would not have queried. This is the strongest single differentiator.
  2. Citation-grade output. Existing tools return lists; the attorney still does the synthesis. An agent can produce a draft prior-art memo (claim-by-claim mapping with citations, anticipation/obviousness flags) in minutes, halving the 10-15 hour search.
  3. Cost compression. A $1,500-$4,000 manual search becomes a $50-$200 agent-assisted search with the attorney spending 1-2 hours reviewing instead of 10-15 hours searching. That’s a 5-10x labor compression at the unit level.

Realistic ICP. ~14,000 registered US patent attorneys (USPTO Roster) + ~5,000 patent agents. Address roughly the 30% in small/mid boutiques and growth-stage in-house = ~5,500 seats addressable in year 1-2. ARPU range $200-$600/seat/month for an agent search tool ($2.4K-$7.2K/year). Gross margin target 75-85% (storage + embedding amortized; per-query LLM cost is the variable line). At $3K ARPU, 1% of TAM = $1.7M ARR. 5% of TAM = $8.3M ARR. Real numbers, not vapor.

3. Customer journey (concrete scenario walkthrough)

Scenario: freedom-to-operate review for a novel solid-state lithium-metal battery cathode chemistry, claim language drafted but not yet filed.

(1) The IP counsel pastes the draft independent claim into the agent: “We claim a cathode comprising lithium nickel manganese cobalt oxide doped with at least 0.5 mol% zirconium, with a particle-surface coating of lithium aluminum titanate, configured for cycling stability above 4.5V vs Li/Li+.” (2) The agent decomposes the claim into ~6 conceptual elements (NMC cathode, Zr doping, LATP coating, high-voltage operation, particle-level coating geometry, stability claim) and runs parallel semantic + CPC-class queries against the indexed USPTO corpus, returning top-200 candidates per element with relevance scores. (3) The agent reranks the union (~600-800 patents) for claim-element coverage and surfaces the top 30 with element-by-element overlap heat maps. (4) The agent drafts a structured FTO memo: per claim element, top 5 patents that read on it with citation (patent number, claim, column:line), an anticipation flag where a single patent reads on >70% of elements, and a synthesis paragraph. (5) Deliverable: a 6-10 page markdown / Word doc with hyperlinked citations to each cited patent on USPTO ODP, ready for attorney review and filing-decision. The attorney spends 60-90 minutes reviewing instead of 10-15 hours searching.

4. Technical architecture sketch (v0)

5. MVP scope cut + 4-6 week build estimate

MVP (4-6 weeks, 1-2 engineers). Ingest USPTO PTGRXML 1976-present (utility only, no design/plant), embed and index in pgvector, ship a CLI + thin web UI that takes a draft claim and returns a 30-patent prior-art shortlist plus a 1-page synthesis with citations. Hosted on a single Hetzner / Fly.io instance + S3 for the XML archive. Charge $500/mo flat for unlimited searches in the closed-beta period.

Explicitly OUT of scope for MVP: pre-1976 OCR’d patents, design/plant patents, foreign patents (EPO/JP/CN/KR/WIPO), non-patent-literature (papers, products, manuals) which is what real FTO requires, file-history retrieval, examiner-data analytics, claim-chart auto-generation, multi-user collaboration, custom CPC-class fine-tuning, on-prem deployment. All real FTO work eventually needs at least non-patent-literature and EPO; v0 is deliberately a US-only-patent-only proof.

6. Revenue model + price floors

Pricing model. Hybrid seat + usage. Anchor seat at $300/seat/month (well below Questel’s enterprise tier, materially above Google Patents = free) with 50 searches/seat/month included, $5 per additional search. This frames the product as “agent that does what your $1,500 manual search did, for $5 marginal cost.”

Per-query unit economics (back of envelope, v0 stack).

Volume (queries/mo)Embedding (incremental)LLM (Sonnet, ~30K input + 4K output tokens/query)Reranker (Cohere)Storage/compute amortizedTotal cost/queryRevenue/query @ $5Gross margin
100<$1~$0.35~$0.05~$2~$2.40$5~52%
1,000<$5~$350~$50~$200 (amortized infra)~$0.60/query$5~88%
10,000<$10~$3,500~$500~$500 (amortized infra)~$0.45/query$5~91%

LLM cost dominates and improves with model price compression (Sonnet 4.6 is already ~50% cheaper per token than Sonnet 3.5 was in 2024; trajectory continues). At 1K queries/mo the unit economics work; at 100/mo they’re tight but viable with the seat fee carrying overhead. Storage and embedding are NOT the binding cost; per-query LLM is.

7. Pre-validation milestones (30 / 60 / 90 day falsification gates)

8. Kill criteria (what would make founder fall OUT of love)

Be honest. Any of these surfaces during validation = wedge dies:

If any TWO of these surface in the 30-day gate, the wedge is dead and we move on. If only one surfaces and it’s #4 (cultural), we keep going but expect a slow ramp.