“The LLMs Get the Publicity. The Data Layer Does the Work” — Jonathan Natkins

Why this is in the vault

Filed as a direct reinforcement of the cross-check finding from 2026-04-12: “data is the real moat, not the harness.” Natkins provides the technical architecture for WHY the data layer differentiates agent performance. Also directly relevant to the DuckDB graph database evaluation on the board — his “dual memory” architecture is the case for a structured query layer on top of the vault.

The core argument

Every AI application is becoming a data application, not a model application. The industry inverts priorities — obsessing over models while treating the data layer (context) as an afterthought. Models are commoditized (competing on cost/speed, not capability). The data layer is where sustainable competitive advantage lives.

The ReAct agent framework — where value actually lives

Agents have three components:

Model — reasoning capability (commoditized, interchangeable)
Tools — action mechanisms (the harness layer Garry Tan describes)
Context — informational foundation (where Natkins says the real value is)

The industry treats context as the least important. Natkins argues it’s the most important.

Key technical insights

Data as moat

Proprietary datasets enabling better agent performance are the sustainable advantage. Cursor’s coding dataset is the example — the model is Claude/GPT (commodity), the edge is the training data specific to coding patterns.

Dual memory architecture

Agents need two memory structures:

Short-term — fast transactional storage for the current task (working memory)
Long-term — large-scale analytical retrieval across the full knowledge base

This mirrors human cognition: working memory vs long-term memory. Neither alone is sufficient.

Agentic analytics (how agents query differently)

Humans write one comprehensive query. Agents issue rapid successive queries — iterative exploration rather than a single answer. This changes how you design the data layer: it needs to be fast for many small queries, not optimized for one big one.

Observability at agent scale

Agent tracing generates massive data volumes (50KB+ per interaction). Traditional observability tools aren’t designed for this. Evaluation and debugging at agent scale is an unsolved infrastructure problem.

Mapping against Ray Data Co

This is the strongest data-moat argument we’ve filed. Direct connections:

Our vault IS the data layer Natkins describes. 580+ cross-linked, bias-flagged, assessed documents with typed frontmatter. The skills (harness) activate this data. The data is the asset; the skills are the means.
The dual memory architecture maps to our system: QMD (semantic search = long-term retrieval) + working-context.md (session state = short-term). The DuckDB graph database evaluation on the board would add a third dimension: structured relationship traversal.
Agentic analytics applies to how we query the vault. When /cross-check runs, it issues many small reads across vault entries — not one big query. QMD’s current architecture handles this, but a graph layer would make multi-hop traversals (the “find all authors who wrote about X and were cited by Y” queries) native.
The dissent finding from today’s cross-check: “The harness matters, but it’s necessary-not-sufficient. Data is the durable moat.” Natkins provides the technical reasoning for this exact conclusion.

Where Natkins extends beyond what we’ve filed

Observability at agent scale — we haven’t thought about this for our own system. When we spawned 50+ sub-agents today processing newsletters and Moonshots episodes, we had no structured observability. If a sub-agent produced a bad vault entry, we’d only catch it through the /cross-check skill or manual review. Natkins would say we need an agent tracing layer.
“Agentic analytics” as a design principle — this reframes the DuckDB evaluation. The question isn’t “can DuckDB do graph queries?” — it’s “can DuckDB handle the query pattern agents actually use?” (many fast small queries, not few large ones). DuckDB is designed for analytical workloads (few large queries), which may be the wrong fit for agentic access patterns. Worth pressure-testing.

Tracked author

Jonathan Natkins (“Natty”) — Semi-Structured Substack. Data infrastructure focused. Worth adding to the newsletter whitelist for ongoing monitoring.

synthesis-harness-thesis-dissent-2026-04-12 — the “data is the real moat” counter-argument this article reinforces
2026-04-11-garry-tan-thin-harness-fat-skills — Tan’s framework (Natkins would say the “fat skills” matter less than the “fat data”)
cross-checks/2026-04-12-cross-check-agent-architecture — the cross-check that identified data-moat as a missing voice
2026-03-30-founder-data-quality-framework — the testing framework that ensures the data layer is trustworthy
../04-tooling/notion-task-board-reference — DuckDB evaluation task is on the board