06-reference

semistructured data layer does the work

Mon Mar 30 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: Semi-Structured (Substack) ·by Jonathan Natkins (Natty)

“The LLMs Get the Publicity. The Data Layer Does the Work” — Jonathan Natkins

Why this is in the vault

Filed as a direct reinforcement of the cross-check finding from 2026-04-12: “data is the real moat, not the harness.” Natkins provides the technical architecture for WHY the data layer differentiates agent performance. Also directly relevant to the DuckDB graph database evaluation on the board — his “dual memory” architecture is the case for a structured query layer on top of the vault.

The core argument

Every AI application is becoming a data application, not a model application. The industry inverts priorities — obsessing over models while treating the data layer (context) as an afterthought. Models are commoditized (competing on cost/speed, not capability). The data layer is where sustainable competitive advantage lives.

The ReAct agent framework — where value actually lives

Agents have three components:

  1. Model — reasoning capability (commoditized, interchangeable)
  2. Tools — action mechanisms (the harness layer Garry Tan describes)
  3. Context — informational foundation (where Natkins says the real value is)

The industry treats context as the least important. Natkins argues it’s the most important.

Key technical insights

Data as moat

Proprietary datasets enabling better agent performance are the sustainable advantage. Cursor’s coding dataset is the example — the model is Claude/GPT (commodity), the edge is the training data specific to coding patterns.

Dual memory architecture

Agents need two memory structures:

This mirrors human cognition: working memory vs long-term memory. Neither alone is sufficient.

Agentic analytics (how agents query differently)

Humans write one comprehensive query. Agents issue rapid successive queries — iterative exploration rather than a single answer. This changes how you design the data layer: it needs to be fast for many small queries, not optimized for one big one.

Observability at agent scale

Agent tracing generates massive data volumes (50KB+ per interaction). Traditional observability tools aren’t designed for this. Evaluation and debugging at agent scale is an unsolved infrastructure problem.

Mapping against Ray Data Co

This is the strongest data-moat argument we’ve filed. Direct connections:

Where Natkins extends beyond what we’ve filed

Tracked author

Jonathan Natkins (“Natty”) — Semi-Structured Substack. Data infrastructure focused. Worth adding to the newsletter whitelist for ongoing monitoring.