“The LLMs Get the Publicity. The Data Layer Does the Work” — Jonathan Natkins
Why this is in the vault
Filed as a direct reinforcement of the cross-check finding from 2026-04-12: “data is the real moat, not the harness.” Natkins provides the technical architecture for WHY the data layer differentiates agent performance. Also directly relevant to the DuckDB graph database evaluation on the board — his “dual memory” architecture is the case for a structured query layer on top of the vault.
The core argument
Every AI application is becoming a data application, not a model application. The industry inverts priorities — obsessing over models while treating the data layer (context) as an afterthought. Models are commoditized (competing on cost/speed, not capability). The data layer is where sustainable competitive advantage lives.
The ReAct agent framework — where value actually lives
Agents have three components:
- Model — reasoning capability (commoditized, interchangeable)
- Tools — action mechanisms (the harness layer Garry Tan describes)
- Context — informational foundation (where Natkins says the real value is)
The industry treats context as the least important. Natkins argues it’s the most important.
Key technical insights
Data as moat
Proprietary datasets enabling better agent performance are the sustainable advantage. Cursor’s coding dataset is the example — the model is Claude/GPT (commodity), the edge is the training data specific to coding patterns.
Dual memory architecture
Agents need two memory structures:
- Short-term — fast transactional storage for the current task (working memory)
- Long-term — large-scale analytical retrieval across the full knowledge base
This mirrors human cognition: working memory vs long-term memory. Neither alone is sufficient.
Agentic analytics (how agents query differently)
Humans write one comprehensive query. Agents issue rapid successive queries — iterative exploration rather than a single answer. This changes how you design the data layer: it needs to be fast for many small queries, not optimized for one big one.
Observability at agent scale
Agent tracing generates massive data volumes (50KB+ per interaction). Traditional observability tools aren’t designed for this. Evaluation and debugging at agent scale is an unsolved infrastructure problem.
Mapping against Ray Data Co
This is the strongest data-moat argument we’ve filed. Direct connections:
- Our vault IS the data layer Natkins describes. 580+ cross-linked, bias-flagged, assessed documents with typed frontmatter. The skills (harness) activate this data. The data is the asset; the skills are the means.
- The dual memory architecture maps to our system: QMD (semantic search = long-term retrieval) + working-context.md (session state = short-term). The DuckDB graph database evaluation on the board would add a third dimension: structured relationship traversal.
- Agentic analytics applies to how we query the vault. When
/cross-checkruns, it issues many small reads across vault entries — not one big query. QMD’s current architecture handles this, but a graph layer would make multi-hop traversals (the “find all authors who wrote about X and were cited by Y” queries) native. - The dissent finding from today’s cross-check: “The harness matters, but it’s necessary-not-sufficient. Data is the durable moat.” Natkins provides the technical reasoning for this exact conclusion.
Where Natkins extends beyond what we’ve filed
- Observability at agent scale — we haven’t thought about this for our own system. When we spawned 50+ sub-agents today processing newsletters and Moonshots episodes, we had no structured observability. If a sub-agent produced a bad vault entry, we’d only catch it through the
/cross-checkskill or manual review. Natkins would say we need an agent tracing layer. - “Agentic analytics” as a design principle — this reframes the DuckDB evaluation. The question isn’t “can DuckDB do graph queries?” — it’s “can DuckDB handle the query pattern agents actually use?” (many fast small queries, not few large ones). DuckDB is designed for analytical workloads (few large queries), which may be the wrong fit for agentic access patterns. Worth pressure-testing.
Tracked author
Jonathan Natkins (“Natty”) — Semi-Structured Substack. Data infrastructure focused. Worth adding to the newsletter whitelist for ongoing monitoring.
Related
- synthesis-harness-thesis-dissent-2026-04-12 — the “data is the real moat” counter-argument this article reinforces
- 2026-04-11-garry-tan-thin-harness-fat-skills — Tan’s framework (Natkins would say the “fat skills” matter less than the “fat data”)
- cross-checks/2026-04-12-cross-check-agent-architecture — the cross-check that identified data-moat as a missing voice
- 2026-03-30-founder-data-quality-framework — the testing framework that ensures the data layer is trustworthy
- ../04-tooling/notion-task-board-reference — DuckDB evaluation task is on the board