How GPT-5, Claude, and Gemini are actually trained and served — Reiner Pope
Why this is in the vault
Filed as the canonical “why does API pricing look the way it does” reference for any future Sanity Check piece on inference economics, agent-deployer cost modeling, or frontier-model substrate. Pope derives the roofline model from first principles — the math predicts Gemini’s 200K context cliff, the 5x decode-vs-prefill spread, and the 10x cache-hit discount, which means RDCO no longer has to treat vendor pricing as opaque. Specifically load-bearing for: (1) the agent-deployer thesis cluster (compute economics determines whether always-on agents stay viable as workloads scale); (2) the harness-thesis cluster (why prompt caching matters so much — it’s the cheapest lever on the whole roofline); (3) any future deep-research brief that needs to call BS on frontier-model marketing claims. The Feistel-network/RevNet side tangent is interesting but skip — the load-bearing payload is the first 90 minutes.
Episode summary
A blackboard-format technical lecture (the inaugural use of Dwarkesh’s new whiteboard studio) where Reiner Pope — CEO of new chip startup Maddox/Maddx, formerly TPU architecture at Google — derives roofline models for transformer inference and training from first principles. The whole conversation is an algebra walkthrough that explains why API prices, latency tiers, context-length cutoffs, and cache-hit pricing look the way they do. Two organizing principles: (1) a roofline analysis comparing memory-fetch time vs compute time on a Blackwell NVL72 rack, and (2) batch-size and context-length sensitivity analysis. The payoff is that the math predicts virtually everything visible in vendor pricing — Gemini’s 200K context cliff, the 5x decode-vs-prefill spread, the 10x cache-hit discount, even the existence of “fast mode” tiers.
Closing third pivots to a fun side-tangent on neural-network/cryptography duality (Feistel networks, RevNets) — interesting but not load-bearing for RDCO.
Key arguments / segments
- 00:00–04:00 — Roofline setup. Inference time is dominated by
max(weight-fetch + KV-fetch, compute). Compute time scales linearly with batch size; weight-fetch is constant; KV-fetch is linear in batch × context. This three-term model has surprisingly strong predictive power. - 04:00–20:00 — Why batch size is everything. Without batching, per-token cost is ~1000x worse. The cost-per-token curve has a hyperbola (weight amortization) plus a constant compute floor. Optimal batch ≈ 300 × sparsity ratio (sparsity = active/total params); for DeepSeek that’s ~8 sequences but in practice ~2,000 tokens-in-flight, since “batch” means concurrent sequences each generating one new token.
- 20:00–32:00 — The “train every 20ms” scheduling model. A new batch departs every ~15-20ms (= HBM capacity / HBM bandwidth, stable across HBM generations). Worst-case queueing latency ≈ 40ms. Mixture-of-experts mapped onto a Blackwell rack uses expert parallelism: 256 experts across 64 GPUs ≈ 4 experts/GPU. Communication pattern is all-to-all, which is the exact fit for NVL72’s switch topology.
- 32:00–46:00 — Why scale-up size, not memory capacity, matters. A rack is bounded by power/weight/cooling and physical cable density (literally bend radius and back-plane connector density limit growth). Crossing rack boundaries pays an 8x bandwidth penalty. Pipeline parallelism solves the memory-capacity problem (split layers across racks) but does NOT solve KV-cache memory because microbatching cancels the per-rack savings. Blackwell finally gave a single scale-up domain enough HBM (~10-20TB) for a 5T-param model + KV cache — which is why models suddenly got bigger in the last 6 months.
- 46:00–1:01 — Pipeline parallelism deep dive. Forward/backward bubble structure; “even in inference, pipeline doesn’t save memory if KV dominates.” Gemini’s apparent pre-training advantage may be from larger TPU scale-up domains plus DeepSeek-style fine-grained MoE.
- 1:03–1:18 — Memory wall macro. Hyperscalers spending ~50% of $1T capex on memory (Dylan Patel claim Reiner finds plausible). But scale-up matters not for capacity (pipelining solves that) but for memory bandwidth — bigger scale-up = parallel weight loads = lower latency.
- 1:18–1:32 — Compute equilibrium across pre-training, RL, inference. Heuristic: minimum-cost point is where the three costs equalize. Working through 6ND for pre-train, ~2-6ND for RL (depends on whether you backward-pass every rollout), 2ND for inference. Result: pre-train tokens ≈ RL tokens ≈ inference tokens served over model lifetime. Implied: if a frontier model serves ~50M tokens/sec and lives 2 months → ~200T inference tokens, which matches reported 150T pre-training tokens — so models are ~100x over Chinchilla-optimal.
- 1:32–1:48 — Reading API pricing as ground truth. Gemini’s 200K context price-doubling cliff is the inflection where KV-fetch overtakes compute. Solving for bytes/token of KV at that crossover gives ~2KB, consistent with character.ai-style cross-layer KV sharing or 8 KV heads × 128 d_head. The 5x decode-vs-prefill spread proves frontier inference is heavily memory-bandwidth-bound.
- 1:48–2:03 — Cache pricing reveals memory tier hierarchy. 10x cache-hit discount is rematerialization cost vs storage cost. The 5-min vs 1-hour write durations let you back out the memory tier — 5-min ≈ HBM-or-DDR drain time, 1-hour ≈ flash, possibly even spinning disk. Reiner is “shocked” labs are still using spinning disk, but the math fits.
- 2:03–2:13 — Bonus: cryptography ↔ neural net duality. Feistel ciphers and RevNets share invertibility construction. Useful for memory-saving training (recompute activations from invertible forward). Tangent, not load-bearing.
Notable claims
- Optimal batch size ≈ 300 × sparsity — dimensionless constant ~300 holds across A100/H100/B100; only varies if FLOPs/bandwidth ratio changes.
- HBM drain time ≈ 15-20ms — stable across HBM generations (Reuben: 288GB / 20TB/s ≈ 15ms). This sets the “train departs every 20ms” cadence.
- 8x bandwidth gap between scale-up (NVLink) and scale-out (data-center fabric). Drives expert parallelism inside a rack.
- Frontier inference: ~hundreds of millions of tokens/sec globally (Reiner cites Gemini brag numbers from last year). To compete: ~1/1000 of Gemini scale = ~128K tokens/sec per deployment.
- Models ~100x overtrained vs Chinchilla — derived from cost-equalization across pre-train/RL/inference. ~150T pre-training tokens vs ~2T Chinchilla-optimal for ~100B active params.
- Frontier API pricing leaks architecture — 200K Gemini cliff implies ~2KB KV/token; 5x decode/prefill ratio implies severe memory-bandwidth bound; cache 5min vs 1hr durations imply HBM/DDR vs flash/disk tiers.
- Long context is gated by memory bandwidth, not compute. Sparse attention (DeepSeek’s √n trick) helps but is not infinite. “I don’t see a very good path to solving” multi-million-token context — the memory wall has no fix in sight. Implication: 100M-token context (Dario’s “in-context-learning is enough for AGI”) is not happening on current trajectory.
- Pipeline parallelism saves model-weight memory but cannot save KV-cache memory because microbatching cancels the per-rack savings.
- Why now for big models: Hopper scale-up = 640GB (8 GPUs); Blackwell scale-up = 10-20TB (72 GPUs). 5T-param + KV cache fit only became possible with Blackwell — that’s why models appear to have stalled at ~1T params from 2022 to late 2025.
Guests
Reiner Pope — CEO of Maddox (also spelled Maddx in transcript; new chip startup, Dwarkesh is angel investor, disclosed up front). Previously worked on TPU architecture at Google. Note: my prior assumption was Modular — that was wrong. He left Google to found Maddox. Tracked-author candidate: STRONG. Reiner is a first-principles inference-economics thinker with hands-on TPU + new-silicon credibility. His framework is the cleanest derivation of “why API prices look like this” I’ve seen filed. Worth adding to RDCO Contact Candidates DB as a tracked-author for inference infrastructure thought leadership.
Dwarkesh Patel — host. Already in vault context. The new blackboard studio format is a notable production-side bet — first-principles technical lectures may become a Dwarkesh sub-format.
Mapping against Ray Data Co
Strength: STRONG. This pairs unusually well with three live RDCO threads from the last week:
- 2026-04-29-every-compute-is-new-cash — Reiner’s math is the engineering substrate for the “compute is the new cash” thesis. He literally derives why hyperscalers are spending 50% of capex on memory: scale-up size determines model size which determines competitive position. The capex isn’t speculation; it’s load-bearing.
- 2026-04-29-stratechery-intel-earnings-terafab — Reiner notes that crossing rack boundaries pays an 8x bandwidth penalty, and that Blackwell’s 4x scale-up jump (Hopper 8 → Blackwell 72) was a product/form-factor decision, not a fundamental tech leap. Reuben going to ~500 is a hard rack-engineering problem (cabling, weight, power). This contextualizes Intel’s foundry positioning — the constraint is increasingly physical/mechanical, not lithographic.
- 2026-04-14-levie-agent-deployer-role-jd — The agent-deployer thesis assumes inference is cheap and getting cheaper. Reiner’s framework gives the floor: for a frontier model, optimal cost-per-token bottoms out at the compute curve and is gated by sparsity ratio. There’s still meaningful compression headroom (DeepSeek-style fine-grained MoE), but the long-context wall caps how “long-running” an autonomous agent can be on one model invocation. Agent-deployer playbooks should assume tool-call boundaries, not infinite context.
- 2026-04-20-data-engineering-central-ram-gpu-cpu-llm-inference — directly adjacent; Reiner formalizes what that piece sketched.
- Anthropic Max use-pattern observation: Our own usage shows Claude Code burns through context windows fast in long agent loops. Reiner’s analysis says this is structural — the KV cache is the most expensive thing in the system per second-of-storage, and there’s no architectural fix coming. Pattern: short-lived focused agents > long-lived “thinking” agents on cost-per-useful-token. This sharpens the case for the agent-deployer role being about pipelines of small invocations, not “set the AI loose for a week.”
One correction to my prior: the brief said “Reiner is at Modular last I knew.” He’s actually at Maddox/Maddx, his own chip startup. Worth correcting in any vault notes that reference his employer.
Decision-relevant takeaways for RDCO:
- The agent-deployer thesis still holds — but expect cost-per-token to compress slowly, not 10x/year. The big inference improvements (Blackwell scale-up, fine-grained MoE) have already happened.
- Long-context-as-substitute-for-continual-learning (Dario’s framing) looks unlikely on current hardware trajectory. Plan agents around tool boundaries.
- Pricing is signal. Vendor pricing tiers (cache, context cliffs, fast mode) are readable architectural disclosures. Worth a “decode the pricing page” pattern in our market-intel routine.
Related
- 2026-04-29-every-compute-is-new-cash
- 2026-04-29-stratechery-intel-earnings-terafab
- 2026-04-20-data-engineering-central-ram-gpu-cpu-llm-inference
- 2026-04-14-levie-agent-deployer-role-jd
- 2026-04-29-alphasignal-warp-open-source-zed-gemma
- 2026-01-06-stratechery-nvidia-groq-deal
- 2026-01-07-stratechery-nvidia-ces-vera-rubin
- transcripts/2026-04-29-dwarkesh-reiner-pope-gpt5-claude-gemini-training-transcript