Moonshots EP 182: AI Experts React — Elon's Grok 4 Is Now #1 in AI

Summary

A special-edition WTF episode with Peter Diamandis, Dave Blundin, Salim Ismail, and Emad Mostaque reacting to xAI's Grok 4 release. The panel's consensus: Grok 4 is genuinely impressive, achieving PhD-level or above performance across every academic subject, but it can reason without yet being able to plan — a distinction Dave Blundin calls the "Jarvis moment" where AI is a brilliant assistant awaiting human direction. Grok 4 scored 100% on AIME (advanced math), 44.4% on Humanity's Last Exam (vs 5% for the smartest humans), and leaped to #1 on the Artificial Analysis Intelligence Index. Emad notes this was achieved in just 28 months from xAI's founding, and that xAI spent equal compute on fine-tuning/post-training as on pre-training — a structural shift from the old ratio of 99:1 pre-training to post-training. The xAI cluster now runs 340,000 GPUs ($10B in hardware). Salim raises the important point that despite superhuman narrow capabilities, modeling even a single human cell requires orders of magnitude more compute. Pricing for Super Grok Heavy at $300/month is discussed as likely a loss leader for enterprise upsell, with Emad predicting equi-intelligence costs will drop 5-10x per year thanks to next-gen Vera Rubin chips. The panel highlights Grok 4 being used by ARC Institute for CRISPR research (sifting millions of experiment logs to pick optimal hypotheses in seconds) and scoring as the best model for chest x-ray evaluation. On medical AI, the panel references the Google study showing AI alone at ~90%+ diagnostic accuracy vs physicians at 70-80%, with the centaur (physician + AI) at ~87% — human bias actually dragging down AI performance.

Key Segments

[00:00-08:00] Grok 4 release reactions, PhD-level in all subjects, reasoning vs planning distinction, 28-month cold start achievement
[08:00-17:00] Benchmark deep-dive — AIME 100%, Humanity's Last Exam 44.4%, benchmark saturation, discovering new physics timeline
[17:00-24:00] Post-training compute parity, 340K GPU cluster, token pricing ($3/$15 per million), 5-10x annual cost deflation
[24:00-33:00] Super Grok Heavy pricing ($300/mo), ARC Institute CRISPR use case, medical AI diagnostic accuracy, game development in 4 hours

Notable Claims

Grok 4 scored 100% on AIME (advanced math benchmark) — effectively saturating it
Humanity's Last Exam: 44.4% vs ~5% for the smartest humans — "way outside the range of human ability"
xAI cluster: 340,000 GPUs, ~$10B in hardware, built in 28 months from founding
Post-training compute now equals pre-training compute (was 1% two years ago, 10% with DeepSeek)
Equi-intelligence cost dropping 5-10x per year; Vera Rubin chips will cut costs 3-4x by hardware alone
Emad predicts Humanity's Last Exam will hit 100% within 2 years, probably next year
ARC Institute using Grok 4 to sift millions of CRISPR experiment logs and identify optimal hypotheses in seconds