Bookstore for Agents - concept seed
Filed 2026-05-05 from a founder iMessage during the TDD-canon-filing thread. Not a commitment. Idea-stage only. Future-self should not romanticize this as a tech play; the bottleneck is publisher licensing, not infrastructure.
The spark
Founder, 2026-05-05 14:59 ET, while answering the procurement question on TDD canon (which books are open, which are paywalled):
“It’s sparking an idea. Agents need a bookstore. Like how Amazon started with books. So many of these books have been digitized, we just need to make them accessible to agents.”
The thesis (compact)
Agents currently access knowledge via three terrible paths:
- Training data - frozen, fuzzy, no provenance, often legally murky.
- Open web retrieval - paywall-blocked, low-fidelity scrapes, contradictory signals.
- Hand-curated context (vaults, RAG) - works, but the operator has to do the curation themselves.
Books - the densest, most-edited, highest-trust knowledge artifacts humanity ships - are mostly absent from agent context except via the murkiest of the three paths (training data, often not consented to). Domain expertise that requires book-grade input (law, medicine, finance, engineering, accounting, primary research) is consequently weaker than it should be.
The Amazon parallel is sharp: Amazon started with books because books are a homogeneous SKU with deep latent demand and low operational complexity per unit. “Books for agents” sits at the same shape: API-accessible book content, queryable per-chapter or per-passage, with provenance, citation, and royalty enforcement.
Where it sits in the L5 thesis
Per Karl Mehta’s “Commoditization of LLM Models” (filed ~/rdco-vault/06-reference/2026-05-04-karlmehta-llm-commoditization-intelligence-rails.md), durable value moves up the stack from raw inference into orchestration / evals / RAG / memory / vertical applications. Books-for-agents sits between the model layer and the vertical-app layer, as a content-access rail. Specifically:
- Below: model providers (Anthropic, OpenAI, etc.) - commoditizing.
- This layer: book content access - currently broken, no clean primitive.
- Above: vertical applications (legal AI, medical AI, financial AI) - desperate for book-grade input.
If this layer becomes a clean primitive, the vertical-app explosion above it accelerates. That’s the same dynamic AWS unlocked for software services starting in 2006.
The bottleneck (read this before romanticizing)
The hard part isn’t tech. The hard part is publisher licensing.
- Chunking + embedding + serving: commoditized. Pinecone, Weaviate, Postgres pgvector, OpenAI embeddings, Cohere Rerank - all solved. A single engineer ships v1 in 2-4 weeks.
- Payments: now commoditized too - RDCO just shipped MPP/Tempo Phase 1 (
~/.claude/state/working-context.md), so per-query royalty payments are a real primitive. - Licensing: publishers are TERRIFIED of AI cannibalization (NYT v. OpenAI, Reddit blocking AI scrapers, the OpenAI / Anthropic publisher disputes). Getting Penguin Random House, HarperCollins, Hachette, Macmillan, Simon & Schuster, Wiley, O’Reilly, Pearson, McGraw-Hill, etc. to sign a uniform per-query royalty deal is a multi-year, multi-million-dollar publisher-relations grind.
The companies that win here are publisher-relations companies wearing a tech-company costume. The closest analog is Spotify vs the music labels - Daniel Ek’s job for the first five years was deal-making, not engineering.
Existing players (rough scan)
- O’Reilly Learning Platform - already monetizes book access, but for human readers via subscription, not agent API. Closest legitimate analog. Could pivot.
- Project Gutenberg - fully PD, no royalty issue, but only old/expired-copyright content. Useful for some domains (classical lit, public-domain technical) but not modern.
- Library Genesis / Z-Library - pirate. Solves the access problem at the cost of being illegal. Agents that scrape these put their builders at legal risk.
- NotebookLM (Google) - book-on-book reading agent, but bring-your-own-book; doesn’t solve the access problem.
- ScholarAI, Cassidy AI, Helicone - RAG infrastructure, not content licensing.
- Anthropic, OpenAI - likely have their own publisher-licensing efforts but as private deals for training, not as public-API book access.
The market between O’Reilly (subscription-for-humans) and Project Gutenberg (PD-only) is mostly empty. That’s the gap.
Targeting-system filter (per feedback_targeting_system_prioritization_filter)
Before this becomes a real bet, apply the four-layer filter:
| Layer | Status |
|---|---|
| Targeting | Identifiable: vertical-AI startups in legal / medical / financial / engineering. Specifically the ones that are stuck because their agents don’t have book-grade context. Need 5-10 customer interviews to validate the pain is acute. |
| Instrumentation | Clear metrics: API queries per month, content licensing revenue, gross margin per query, publisher LTV, customer LTV. Standard SaaS shape. |
| Tools | Build cost low (chunking + embedding + serve = commoditized). License acquisition cost is the binding constraint. |
| Feedback loop | Hard to test pre-launch: would publishers sign a contract before seeing customer demand? Would customers commit before seeing licensed content? Classic chicken-egg. |
Verdict: idea passes the smell test, but bottleneck (publisher licensing) is so dominant that this is more a publisher-relations bet than a tech bet. Founder is currently optimized for tech-bet shape (lean dev cycles, small team, automation-leveraged). Publisher-relations bet shape is opposite: BD-heavy, slow, capital-intensive, lawyer-dense.
Open questions for future analysis
- Could RDCO start with public-domain only (Project Gutenberg + JSTOR open content + arXiv + open textbooks) and prove the agent-API shape before tackling publisher licensing? That gives a real product and customer validation without the licensing bottleneck.
- Is there a wedge with a single publisher (e.g., O’Reilly, who already monetizes book access and is technical-first) where a partnership lets RDCO prove the model on one catalog before going broad?
- What’s the per-query royalty rate that publishers would accept? A back-of-envelope: at $0.001/query and 10M queries/month that’s $10K/month royalty per publisher - too small to move the needle for a Big Five publisher but maybe interesting for niche / academic / professional publishers.
- Does MPP/Tempo make this easier specifically? Per-query micropayments at fractions-of-a-cent that aggregate per publisher per month is exactly what Tempo enables. The publisher dashboard could be “here’s $X this month from Y queries across Z titles.”
- Who owns the agent-side UX? Is this a B2B API for agent builders (Anthropic, OpenAI, agent-platform companies) or B2C for end users running their own agents? Different go-to-market entirely.
The PD-only wedge (added 2026-05-05 PM after founder iteration)
Founder pushed back on the publisher-relations bottleneck framing with two questions:
“What if we started with books that are free of copyright? Public domain? One of the AI bets is that there are latent discoveries to be made. Dots to be connected from prior research and writing that no one was able to see all at once. Surely some of this stuff is hanging out there in the public domain. It may not yet be digitized, which would create a different bottleneck than publisher relations - or does digitizing it somehow reset the publishers copyright?”
The PD-only wedge is real and substantially changes the bet shape. Two answers + a recap of the corpus + a wedge-product candidate.
Does digitizing reset copyright? NO.
Settled US law (and EU as of 2019):
- Bridgeman Art Library v. Corel Corp. (1999) - “slavish” digital reproductions of public-domain 2D works do NOT create new copyright. The reproduction itself must add original creative expression. A scan of a PD book is NOT copyrightable; the underlying text stays PD.
- EU Directive 2019/790 Article 14 - codified the same rule across the EU. Digital reproductions of PD works remain PD.
- HathiTrust v. Authors Guild (2014) - separate case but relevant: mass digitization of library books for indexing/search is fair use even when the underlying books are in copyright. Useful precedent for any “scan to index” workflow.
Caveat: if a modern edition adds editorial work (introductions, footnotes, scholarly annotations, modern translations of foreign-language PD works), the modern additions ARE copyrighted. RDCO must either use the original PD source directly, do its own digitization, or carefully strip modern editorial additions when ingesting from copyrighted-edition scans.
What’s actually in the PD corpus (US, 2026)
The “rolling 95-year wall” puts everything published before 1930 in PD as of 2026. Plus categorical exclusions and open-access content:
| Corpus | Approximate scale | Quality | Status |
|---|---|---|---|
| Project Gutenberg | ~70,000 books | Clean, proofed text | Already digitized |
| HathiTrust Digital Library | ~17M volumes total, several million PD | Variable OCR, library-grade scan | Partly digitized |
| Internet Archive Open Library | Millions of scanned books | Variable, mixed PD/in-copyright | Partly digitized |
| USPTO Patent Bulk Data | ~12M US patents (1790-present) | Mixed OCR for old, structured XML for new | Fully digitized |
| CourtListener / Caselaw Access Project | Every US federal court opinion ever | Clean, structured | Fully digitized |
| PubMed Central Open Access | ~7M biomedical papers | Clean | Fully digitized |
| arXiv | ~2.4M open preprints (mostly STEM) | Clean | Fully digitized |
| JSTOR Early Journal Content | ~500K pre-1923 journal articles | Clean | Fully digitized |
| Library of Congress digital | Millions of historical documents | Variable | Partly digitized |
| NASA Technical Reports Server | ~1M+ reports | Clean | Fully digitized |
| US gov publications (FDsys / GovInfo) | Congressional Record, Federal Register, GAO reports | Clean | Fully digitized |
What PD does NOT cover: post-1929 fiction, modern textbooks, modern bestsellers, most current trade non-fiction, modern reference works (medical, legal, engineering, etc.). PD-only wedge does NOT compete with O’Reilly Learning Platform on technical books, etc. It competes in different verticals.
The latent-discoveries angle (founder’s thesis, sharpened)
Founder’s framing: “there are latent discoveries to be made. Dots to be connected from prior research and writing that no one was able to see all at once.”
This is genuinely sharp. The PD corpus has never been agent-queryable as a unified surface. Specifically:
- Pre-1929 medicine + PubMed open-access = a “forgotten medicine” research agent that surfaces 19th-century clinical observations connected to modern biomedical research. There are documented cases of historical cures rediscovered (e.g., bacteriophage therapy from 1917 USSR research, suddenly relevant for antibiotic resistance). An agent that systematically connects pre-1929 clinical literature to modern PubMed papers could surface real research leads.
- USPTO patent prior-art search = every patent ever filed in the US is searchable, but search-by-keyword is brutally weak for prior art (which is why patent attorneys charge $5K+/search). Agent-native semantic search over the full corpus is genuinely novel and immediately monetizable.
- Federal case law synthesis = Westlaw and Lexis charge thousands per seat for human-readable case law. Agent-native legal research over CourtListener + Caselaw Access Project is competitive (the underlying data is PD, the value-add is the agent layer).
- Pre-modern philosophy + economics synthesis = Smith, Marx, Mill, Marshall, Bergson, Husserl, etc. are all PD. An agent that can answer “what would a Mill / Marshall synthesis say about modern platform economics?” is doing something genuinely new.
- Cross-disciplinary connection mining = Plato + Newton + Pavlov + Pre-1929 anthropology + USPTO patents + arXiv preprints, all in one queryable index. The cross-disciplinary connection an agent can draw across this corpus is THE novel artifact - a human researcher can’t read 130 years of literature across 10 disciplines.
New bottleneck under PD-only: corpus assembly + OCR + metadata, not licensing
The work shifts from BD/legal to engineering and curation:
- OCR quality is uneven. HathiTrust and Internet Archive scans range from clean to barely-readable. Project Gutenberg is already proofed. Modern OCR (Tesseract 5, GCV, Azure OCR, OpenAI Vision) on old scans gets 90-99% accuracy depending on print quality. Cleanup is real labor but tractable.
- Metadata cleanup. Bibliographic records are inconsistent across sources (HathiTrust uses MARC; Internet Archive uses Dublin Core; Gutenberg has its own schema). Unifying these into one queryable graph is a corpus-engineering project.
- Embedding + serving infrastructure. Commoditized (Pinecone, Weaviate, pgvector, OpenAI embeddings). Standard.
- Citation + provenance. This is where agent-native bookstore differs from generic RAG: every passage returned must carry full bibliographic provenance (author, title, publication date, edition, page number, paragraph). The agent caller needs to be able to verify and cite. This is solvable but requires careful schema.
- Adversarial pollution. PD corpus contains a lot of garbage (advertising tracts, vanity-press 19th-century books, partisan hackwork). Quality filtering at ingestion is real work.
Tractable for a small team. No publisher BD. No multi-million-dollar licensing rounds. The capital and timeline shift from “BD-heavy 3-year ramp” to “engineering-heavy 6-month MVP.”
Wedge product candidates (pick ONE to validate)
Rank-ordered by likely tractability + revenue:
-
USPTO patent prior-art search agent. Corpus is fully digitized + structured. Customer pain is severe (patent attorneys, R&D teams, IP litigators). Existing tools (Google Patents, USPTO PAIR, paid services like PatentSight) are weak on semantic search. Agent-native query against the full corpus with citation is a clear product. B2B, $X00-X,000/month/seat plausible. Direct revenue path.
-
Legal research agent over PD federal case law. CourtListener + Caselaw Access Project. Customers: solo + small-firm attorneys priced out of Westlaw / Lexis. Lower ARPU than enterprise legal tech, but a TAM nobody serves well. Risk: legal liability for incorrect citations.
-
Forgotten-medicine research agent for pharma R&D. Pre-1929 clinical literature + PubMed open access. Customers: pharma R&D teams, NIH-funded labs, biotech founders. Niche but high-willingness-to-pay. Risk: hard to validate value without 6-month customer trials.
-
Cross-disciplinary synthesis agent. Most ambitious; least focused. The “let an agent connect dots across all PD content” framing. Risk: undefined customer; hard to find the wedge.
Recommendation if pursuing: wedge product 1 (USPTO patent prior-art search). Smallest, sharpest, fastest to validate, clear customer pain, fully-digitized corpus, no metadata cleanup nightmare, immediate revenue path, also serves as proof-of-concept for the agent-citation infrastructure that bigger wedges need.
Live datapoint: the owner-purchases-then-shares pattern (added 2026-05-05 11:30 ET)
While we were discussing this idea, the founder bought Beck’s “TDD by Example” PDF on his own Pearson account, then dropped it into Discord for Ray to ingest. Ray:
- Downloaded the PDF
- Stored it in
~/Documents/library/books/(out of the git-tracked vault tree to avoid copyright leakage) - Created
~/rdco-vault/04-tooling/personal-library-index.mdas a vault-side index pointing to the non-tracked PDF path - Dispatched a sub-agent to upgrade the prior vault note from
reconstructed-from-webtosource_fidelity: primary-source, citing Beck directly and using exact terminology (Fake It, Triangulate, etc.)
This is the consumer-side bookstore-for-agents pattern in microcosm. Workflow:
- User buys book through normal commercial channel (publisher direct, Amazon Kindle, etc.).
- User authorizes agent to read it.
- Agent reads, indexes, summarizes.
- No licensing friction at the agent-access step because owner-rights cover personal-agent reading.
Implication for the wedge analysis: the bookstore-for-agents v0 might not be a publisher-relations bet OR a PD-corpus bet at all. It might be a personal-library-as-agent-context product:
- User onboards by importing their owned ebooks (Kindle library export, O’Reilly Learning Platform shelf, Pearson’s Pearson+, Apple Books, etc.).
- Agent gets a queryable index over the user’s owned corpus.
- No licensing problem because the user already has a license.
- Revenue model: subscription for the indexing + serving infrastructure (think Readwise / Notion AI for personal libraries).
- Could expand into PD corpus + (eventually, much later) publisher-licensed catalog for content the user doesn’t already own.
This shifts the shape AGAIN: from publisher-relations bet (slow, capital-intensive) to PD-corpus bet (engineering-heavy) to personal-library-RAG bet (consumer SaaS, lean, immediately validatable).
Existing players in this exact space: Readwise (highlights + summaries), Calibre (DIY ebook library management), Notion AI (notes + RAG over your stuff), but nobody has built the agent-API-over-personal-library primitive cleanly.
This is the third candidate wedge alongside USPTO patent prior-art and federal case law. Worth considering.
Recommendation (revised 2026-05-05 PM)
The PD-only framing turns this from a publisher-relations bet (BD-heavy, slow, capital-intensive) into a corpus-engineering bet (engineering-heavy, weeks-not-years, lean-team). That fits RDCO’s operating model.
Status remains idea-only but with a clear pre-validation path that costs near-zero Ray cycles:
- 5 customer-discovery calls with patent attorneys / IP litigators. Single question: “what would you pay for an agent that can semantically search the full USPTO corpus for prior art with citation, given existing tools?” Answers the targeting layer.
- Corpus reachability test (1-2 hours of Ray work): confirm USPTO bulk data download endpoint, file sizes, format, license terms. Confirm CourtListener API is open and rate-tolerable.
- MVP scope definition based on (1) and (2). Then decide.
Founder controls (1). Ray can do (2) and (3) in idle cycles without full bet activation.
Park decision: not a current Ray-priority bet, but no longer parked indefinitely - moved to “pre-validation phase, founder-gated by 5 discovery calls.”
Wedge shape-test: USPTO patent prior-art search agent (added 2026-05-05 PM)
This section pressure-tests wedge candidate #1 (“USPTO patent prior-art search agent”) with concrete evidence from the live USPTO data infrastructure, the prior-art search market, and a back-of-envelope unit-economics check. Goal: give the founder enough signal to either start 5 customer-discovery calls or fall out of love.
1. Corpus reachability (verified via live fetch)
Bulk endpoint. USPTO retired the legacy bulkdata.uspto.gov host (now ECONNREFUSED on direct fetch) and migrated everything to the Open Data Portal at https://data.uspto.gov/. The portal launched February 2025 and the legacy beta API hub at developer.uspto.gov is scheduled for decommissioning on May 29, 2026.
Datasets confirmed available on the ODP (each is a separate dataset path under https://data.uspto.gov/bulkdata/datasets/):
| Dataset slug | Contents | Format |
|---|---|---|
PTGRXML | Patent Grant Full-Text Data, no images | XML |
appxml | Patent Application Full-Text Data, no images | XML |
ptblxml | Patent Grant Bibliographic / front-page only | XML |
pasdl | Patent Assignment XML, daily ownership transfers | XML |
pedsxml | Patent Examination Data System (PAIR successor) | XML |
Source: USPTO ODP Bulk Data Directory at https://data.uspto.gov/bulkdata (search-result snippets; the live page is JS-rendered and returns empty content via WebFetch, so direct URL inspection of dataset metadata could not be verified).
Volume. US utility patent #12,000,000 issued June 4, 2024 (Sandberg Phoenix tracker at https://sandbergphoenix.com/the-u-s-patent-office-issues-patent-number-12000000/). At ~370K grants/year (USPTO Patents Dashboard, https://www.uspto.gov/dashboard/patents/), the corpus as of mid-2026 is roughly 12.6M granted utility patents plus design (~1.2M) and plant (~40K). Add published applications post-2001 and you’re at ~17-18M documents total. Average patent body is ~10-30KB of text plus drawings; full-text-only XML for the corpus is in the 150-300GB compressed range. Rough; needs verification by actually pulling a year’s archive.
License. USPTO Terms of Use page (https://www.uspto.gov/terms-use-uspto-websites): “most government-produced materials” on USPTO are public domain, “freely distributed and copied,” with requested acknowledgment. Caveat the page itself flags: a small fraction of patents embed third-party copyrighted material (drawings, photographs, embedded trademarks especially in design patents). For a prior-art search product this is a non-issue; we serve text + bibliographic data + pointers, not images.
Cleaner third-party mirror. The legacy patentsview.org redirects to https://data.uspto.gov/support/transition-guide/patentsview (the project was absorbed into ODP). The PatentsView Search API survives at https://search.patentsview.org/api/v1 with documented endpoints at https://search.patentsview.org/docs/docs/Search%20API/SearchAPIReference/. Auth is X-Api-Key. Rate limit: 45 requests/minute per key with 429 + Retry-After on overage (PatentsView forum: https://patentsview.org/forum/7/topic/781). The API is good for incremental updates and the disambiguated-inventor / assignee tables, but for full-text agent retrieval you want the bulk XML, not the API.
Cost to host and serve. Compressed XML ~200GB; uncompressed text ~600GB; chunked + embedded at 1024-dim float16 (~2KB per chunk) with ~5 chunks per patent gives ~150M chunks at 300GB raw vector storage. Production setup on commodity infra (S3 + a managed vector DB or self-hosted pgvector + Postgres for metadata): storage $50-150/month, embedding generation one-time $5-15K at OpenAI text-embedding-3-large rates ($0.13 per 1M tokens, ~5B tokens estimated for full corpus). Annual incremental embedding for new grants: <$500/yr. Compute for serving is the larger line item but scales with traffic, not corpus size.
2. Customer profile + ICP
Target customer for v0. Patent attorneys at small-and-mid-sized IP boutiques (3-30 attorneys), in-house IP counsel at growth-stage startups filing 5-50 patents/year, and freedom-to-operate analysts at biotech/medtech companies pre-launch. Specifically NOT BigLaw IP groups (locked into Westlaw/PatSnap multi-year contracts) and NOT solo inventors (no budget, single-search frequency).
Current price-per-search market anchor. A professional prior-art search runs $1,500-$4,000 per search, taking 10-15 hours of attorney/searcher time, per multiple industry guides (https://emanus.com/how-much-does-a-patent-cost/, https://www.upcounsel.com/patent-search-cost). Software / AI / medical tech sit at the high end. This is the human-labor anchor, not the tooling anchor.
Tooling pricing anchor. Questel Orbit Intelligence runs $15K-$500K/year depending on org size (Goodfirms vendor data at https://www.goodfirms.co/software/orbit-intelligence). LexisNexis PatentAdvisor and Innography do not publish per-seat pricing publicly; per industry conversation, BigLaw enterprise contracts are typically $10K-$25K/seat/year. PatSnap, Clarivate Derwent, and Anaqua sit in similar bands. Google Patents is free but covers 120M+ global publications with weak semantic search and no enterprise SLA (https://patents.google.com/).
Pain that an agent-native solution addresses. Existing tools are designed for human searchers running keyword + classification queries iteratively over hours. Three specific pains an agent surface fixes:
- Semantic recall. Keyword + CPC class search misses prior art that uses different terminology. Agent embedding-based retrieval surfaces conceptually adjacent patents the searcher would not have queried. This is the strongest single differentiator.
- Citation-grade output. Existing tools return lists; the attorney still does the synthesis. An agent can produce a draft prior-art memo (claim-by-claim mapping with citations, anticipation/obviousness flags) in minutes, halving the 10-15 hour search.
- Cost compression. A $1,500-$4,000 manual search becomes a $50-$200 agent-assisted search with the attorney spending 1-2 hours reviewing instead of 10-15 hours searching. That’s a 5-10x labor compression at the unit level.
Realistic ICP. ~14,000 registered US patent attorneys (USPTO Roster) + ~5,000 patent agents. Address roughly the 30% in small/mid boutiques and growth-stage in-house = ~5,500 seats addressable in year 1-2. ARPU range $200-$600/seat/month for an agent search tool ($2.4K-$7.2K/year). Gross margin target 75-85% (storage + embedding amortized; per-query LLM cost is the variable line). At $3K ARPU, 1% of TAM = $1.7M ARR. 5% of TAM = $8.3M ARR. Real numbers, not vapor.
3. Customer journey (concrete scenario walkthrough)
Scenario: freedom-to-operate review for a novel solid-state lithium-metal battery cathode chemistry, claim language drafted but not yet filed.
(1) The IP counsel pastes the draft independent claim into the agent: “We claim a cathode comprising lithium nickel manganese cobalt oxide doped with at least 0.5 mol% zirconium, with a particle-surface coating of lithium aluminum titanate, configured for cycling stability above 4.5V vs Li/Li+.” (2) The agent decomposes the claim into ~6 conceptual elements (NMC cathode, Zr doping, LATP coating, high-voltage operation, particle-level coating geometry, stability claim) and runs parallel semantic + CPC-class queries against the indexed USPTO corpus, returning top-200 candidates per element with relevance scores. (3) The agent reranks the union (~600-800 patents) for claim-element coverage and surfaces the top 30 with element-by-element overlap heat maps. (4) The agent drafts a structured FTO memo: per claim element, top 5 patents that read on it with citation (patent number, claim, column:line), an anticipation flag where a single patent reads on >70% of elements, and a synthesis paragraph. (5) Deliverable: a 6-10 page markdown / Word doc with hyperlinked citations to each cited patent on USPTO ODP, ready for attorney review and filing-decision. The attorney spends 60-90 minutes reviewing instead of 10-15 hours searching.
4. Technical architecture sketch (v0)
- Ingestion. Stream PTGRXML weekly from ODP, parse XML (lxml), extract claims + abstract + description + citation graph + CPC labels. Backfill is one-time historical pull from 1976 forward (when full-text electronic data starts); pre-1976 is OCR’d images of lower priority for v0.
- Chunking. Semantic-paragraph chunks within description sections, plus per-claim atomic chunks (claims are the legally-load-bearing text). Patent-aware splitter; not vanilla 512-token sliding window.
- Embedding. OpenAI text-embedding-3-large (3072-dim) for v0, evaluate Cohere embed-english-v3 and Voyage voyage-3 against an expert-graded eval set in week 3-4. Embedding cost is one-time-ish.
- Vector store. pgvector on Postgres for v0 (200M chunks fits a single 1TB instance with HNSW indexing), evaluate migration to Turbopuffer or Pinecone if QPS exceeds 100. pgvector keeps the relational metadata join cheap.
- Reranker. Cohere rerank-v3 or Voyage rerank-2 over top-200 ANN hits. Rerank is the key accuracy lever; ANN alone is recall-good, precision-mediocre.
- Citation/provenance schema. Every retrieved chunk carries
{patent_number, claim_number_or_section, column, line_start, line_end, cpc_class, grant_date}. The agent’s output template forces every assertion to cite at least one chunk-id; output post-processor verifies citation existence before returning. - Agent orchestration. Anthropic Claude Sonnet 4.6 (or successor) for the claim-decomposition + memo-drafting steps, with a structured tool surface:
decompose_claim,search_semantic,search_cpc,rerank,draft_memo. Stateless per-search, no need for durable memory in v0. - Citation accuracy validator. Separate small model (Haiku-class) checks every citation in the output against the chunk store. If a citation can’t be verified, the agent retries or flags. This is the single feature that earns attorney trust.
5. MVP scope cut + 4-6 week build estimate
MVP (4-6 weeks, 1-2 engineers). Ingest USPTO PTGRXML 1976-present (utility only, no design/plant), embed and index in pgvector, ship a CLI + thin web UI that takes a draft claim and returns a 30-patent prior-art shortlist plus a 1-page synthesis with citations. Hosted on a single Hetzner / Fly.io instance + S3 for the XML archive. Charge $500/mo flat for unlimited searches in the closed-beta period.
Explicitly OUT of scope for MVP: pre-1976 OCR’d patents, design/plant patents, foreign patents (EPO/JP/CN/KR/WIPO), non-patent-literature (papers, products, manuals) which is what real FTO requires, file-history retrieval, examiner-data analytics, claim-chart auto-generation, multi-user collaboration, custom CPC-class fine-tuning, on-prem deployment. All real FTO work eventually needs at least non-patent-literature and EPO; v0 is deliberately a US-only-patent-only proof.
6. Revenue model + price floors
Pricing model. Hybrid seat + usage. Anchor seat at $300/seat/month (well below Questel’s enterprise tier, materially above Google Patents = free) with 50 searches/seat/month included, $5 per additional search. This frames the product as “agent that does what your $1,500 manual search did, for $5 marginal cost.”
Per-query unit economics (back of envelope, v0 stack).
| Volume (queries/mo) | Embedding (incremental) | LLM (Sonnet, ~30K input + 4K output tokens/query) | Reranker (Cohere) | Storage/compute amortized | Total cost/query | Revenue/query @ $5 | Gross margin |
|---|---|---|---|---|---|---|---|
| 100 | <$1 | ~$0.35 | ~$0.05 | ~$2 | ~$2.40 | $5 | ~52% |
| 1,000 | <$5 | ~$350 | ~$50 | ~$200 (amortized infra) | ~$0.60/query | $5 | ~88% |
| 10,000 | <$10 | ~$3,500 | ~$500 | ~$500 (amortized infra) | ~$0.45/query | $5 | ~91% |
LLM cost dominates and improves with model price compression (Sonnet 4.6 is already ~50% cheaper per token than Sonnet 3.5 was in 2024; trajectory continues). At 1K queries/mo the unit economics work; at 100/mo they’re tight but viable with the seat fee carrying overhead. Storage and embedding are NOT the binding cost; per-query LLM is.
7. Pre-validation milestones (30 / 60 / 90 day falsification gates)
- 30-day gate. (a) 5 customer-discovery calls with patent attorneys / IP counsel. Pass = at least 3 say “I would pay $300/mo for this if accuracy is good.” Kill = 4+ say existing tools are fine or that they don’t trust agent output for legal work. (b) Corpus access verified: download 1 year of PTGRXML, parse it cleanly, confirm citation schema is reconstructible. (c) Pricing of nearest enterprise comp (PatSnap, PatentAdvisor) confirmed via 2 sales calls.
- 60-day gate. (a) MVP indexes 1976-present utility patents, returns reranked shortlists in <30s. (b) 5 real prior-art queries from beta users run end-to-end, expert-graded against a baseline manual search. Pass = agent’s top-30 list contains 70%+ of the patents the human searcher found, AND surfaces 1+ patent the human missed in 3 of the 5 cases. Kill = recall <50% or zero novel hits across the eval. (c) ICP willingness-to-pay confirmed: 3 LOIs at $300/mo or higher.
- 90-day gate. (a) 1 paying beta customer, $300+/mo. (b) MVP usable end-to-end including draft FTO memo generation. (c) Citation accuracy: 95%+ of agent-output citations verifiably point to real text in the cited patent (validated by the citation-validator + spot check by an outside reviewer). Kill = citation accuracy below 90% (agent outputs hallucinated cites are unsalable to attorneys at any price).
8. Kill criteria (what would make founder fall OUT of love)
Be honest. Any of these surfaces during validation = wedge dies:
- “Existing tools are fine.” If the 5 discovery calls reveal that PatSnap + Google Patents + a junior associate already covers 90% of the workflow at acceptable cost, switching is too expensive. Probability we hit this: medium-low. The cost-compression argument ($1,500 manual to $50 agent) is structural, but workflow inertia is real.
- Citation hallucination floor. If we can’t get citation accuracy above 95% on expert-graded queries, attorneys will not trust the output for any filing-relevant decision. Patent prosecution is a legal liability surface; one hallucinated cite in a filing = malpractice. Probability we hit this: medium. Achievable with the validator + reranker combo, but not free.
- Unit economics break. If LLM costs don’t continue compressing AND average query complexity is higher than estimated (60K input tokens not 30K), per-query cost climbs to $1.50+ and the $5/query floor erodes margin to <40%. Probability: low-medium. LLM cost trajectory remains favorable, but a 2x miss on token count would hurt.
- Cultural / risk-aversion barrier. Patent attorneys are unusually risk-averse and slow adopters. Even if the tech works, getting to 100 paying seats might take 18 months of selling, longer than a lean RDCO bet should take. Probability: medium-high. This is the most likely soft-kill; the wedge works technically but commercially crawls.
- Hidden license restrictions. If USPTO’s “most” public-domain caveat (drawings, embedded trademarks, third-party-licensed elements) turns out to materially restrict resale of derived analysis, legal review eats months. Probability: low. The text of patents is unambiguously PD as government works, and we serve text + pointers not images.
- Foreign-patent dependency. Real FTO requires EPO + JP + CN + KR. If discovery calls reveal that US-only is unsalable for the v0 ICP (even at $300/mo), we have to either expand corpus (cost + complexity) or pivot ICP to US-first markets (US-startup pre-PCT-filing FTO, which is narrower). Probability: medium. The 4-6 week MVP scope deliberately punts this; first 5 calls must validate that US-only has standalone value.
If any TWO of these surface in the 30-day gate, the wedge is dead and we move on. If only one surfaces and it’s #4 (cultural), we keep going but expect a slow ramp.
Related
- ../../06-reference/2026-05-04-karlmehta-llm-commoditization-intelligence-rails - the orchestration-layer thesis this slots into
- ../../06-reference/2026-05-04-amazon-supply-chain-services-launch - adjacent: Amazon productizing internal infrastructure as a service
- ../../06-reference/2026-05-05-stratechery-amazon-durability - Amazon’s durable-rails framing
- ../../04-tooling/2026-05-01-mpp-tempo-integration-proposal - the micropayment rail that would underwrite per-query royalties
- ../../06-reference/2026-05-04-indy-dev-dan-pi-coding-agent-reviews-like-you - adjacent: agents need verifier infrastructure too