“My M5 Max, Gemma 4, MLX LOCAL Stack. (This KILLS MODEL PROVIDERS)” — IndyDevDan

Why this is in the vault

Direct operational input for the Mac-Mini-as-COO architecture. Dan benchmarks two-tier agent design (cloud orchestrator + local micro-agents on MLX) with concrete numbers — 2x decode speedup from MLX-NVFP4 over GGUF, 16K context cliff as the architectural boundary, 30 tok/s usability floor. The COO harness already runs on Apple Silicon; this video tells us where the leverage is in the stack we’re already on. Outage independence (filmed during an Anthropic API outage) is a now-credible engineering target rather than speculation.

Episode summary

Dan benchmarks the brand-new M5 Max MacBook Pro against the previous-gen M4 Max across three local-inference workloads — cold/warm prompt throughput, context-scaling under load, and full agentic coding via the Pi coding agent — using Gemma 4 and Qwen 3.5 in both GGUF and MLX/NVFP4 variants. Headline finding: MLX-formatted models on Apple Silicon decode at roughly 2x GGUF speed (118 tok/s vs 60 tok/s for Qwen 3.5), and the M5 Max is 15-50% faster than the M4 Max on real wall-clock workloads. The framing argument is openly polemical: cloud APIs are a “rental racket” and the local-inference cliff is closer than most engineers believe — Dan predicts a “Sonnet/Opus 4.0-class” model will run usably on consumer hardware by end of 2026.

Key arguments / segments

[00:00] M5 Max unboxing + benchmark setup. Filmed during yet another Anthropic API outage — used as live evidence for the local-stack thesis. Both machines fully specced; comparing across four model variants (Qwen 3.5 GGUF, Qwen 3.5 MLX-NVFP4, Gemma 4 GGUF, Gemma 4 MLX).
[02:22] Cold prompt benchmark. Models load into memory; first inference is variable. M4 actually warms faster on Qwen 3.5 in this run — non-deterministic, so single runs don’t decide anything.
[03:17] Warm prompt benchmark. Three key metrics introduced: prefill speed (prompt processing), decode speed (true tok/s), and wall clock (the only one that matters end-to-end). M5 leads on prefill and decode across all four models.
[04:57] Benchmark 1 — five-prompt sweep across all four models. Tests: simple prompts (hash-table explanation through rate-limiter design). MLX variants smoke GGUF: Qwen 3.5 GGUF = 60 tok/s decode, Qwen 3.5 MLX-NVFP4 = 118 tok/s. Prefill on MLX is nearly double GGUF. M5 wins ~15-50% wall time over M4.
[14:20] Benchmark 2 — context-scaling cliff. Long-context Graph Walks. Local inference falls off hard past ~16K tokens — a hard ceiling for current local agents that don’t have it on the cloud side.
[27:07] Benchmark 3 — Pi coding agent live workload. Can a local model actually do agentic coding? Yes, with caveats — the micro-agent thesis: dispatch narrow well-scoped subtasks to local models, keep cloud models for the wide-context orchestration tier.
[34:55] Takeaways. If you’re on Apple Silicon and not using MLX, you’re leaving 2x on the table. Gemma 4 is the most parameter-efficient model Dan has tested (“intelligence per parameter”). The M5/M6 Ultra with rumored 500GB unified RAM will obliterate API-as-a-service for a meaningful task tier.

Notable claims

MLX-NVFP4 ~2x decode of GGUF on Apple Silicon (118 vs 60 tok/s for Qwen 3.5; ~100 vs 60 for Gemma 4) [11:01]
M5 Max is 15-50% faster than M4 Max on real local-inference workloads end-to-end [12:01]
30 tok/s = “fully usable” floor; under 20 tok/s = “dead zone” — Dan’s working threshold for local model viability [11:01]
Gemma 4 MLX fits a 26B-parameter model in ~16GB RAM (NVFP4 quantization) [09:00]
Local context cliff at ~16K tokens — hard performance falloff above that for current local stacks
Prediction: Sonnet/Opus 4.0-class model running on-device by end of 2026 [38:02]
Stack used: mlx-vlm, Ollama (which now natively supports MLX models), Pi coding agent, custom live-bench benchmarking harness (github.com/disler/live-bench)

Guests

Solo episode — Dan only. No sponsor segments. Featured tool: Pi coding agent (pi.dev) — Dan has been promoting Pi consistently across recent episodes.

Mapping against Ray Data Co

This is directly load-bearing for the Mac-Mini-as-COO architecture. Ben’s always-on agent already runs on a Mac Mini; the upgrade path Dan is mapping (M5 Max now → M5/M6 Ultra later) is the same path Ben is on. Three concrete action items fall out:

Switch any local-model code paths to MLX-NVFP4 immediately if not already. A 2x decode speedup is free wall-clock against the same hardware. If we’re running Ollama with default GGUF for any micro-agent task (skill triage, fast-path classification, content tagging), this is leaving 2x on the table.
The micro-agent thesis is the right tier-split for the COO harness. Pattern: cloud (Claude Opus 4.7) for wide-context orchestration; local Gemma 4 / Qwen 3.5 MLX for narrow, well-scoped subtasks (newsletter classification, vault tag assignment, idempotency checks, transcript summarization sub-routing). The 16K context cliff is the right architectural boundary — if a sub-agent task fits in 16K, local. If not, cloud.
Outage independence is a real operational benefit, not just hype. The Anthropic API outage Dan filmed during is the same outage class that has bitten the COO harness multiple times. Local fallback for the “keep working on this while the API is down” tier is now a credible engineering target — was speculative six months ago.

The polemic framing (“model providers don’t want you to see this”) is overdone, but the underlying engineering claim is verifiable and the cost math is real. Worth queuing a vault concept article on the two-tier agent architecture (cloud orchestrator + local micro-agents) as the synthesis of this video and the broader IndyDevDan agent-thread cluster from the past month.

2026-04-20-indy-dev-dan-mac-mini-agents-openclaw-nightmare-skills-instead — same channel, prior episode on Mac Mini local-agent architecture
2026-04-20-indydevdan-pi-agent-teams-harness-engineering — Pi coding agent context, the tool used in Benchmark 3
2026-04-20-indydevdan-agent-threads-boris-cherny — agent-thread cluster context
2026-04-19-alphasignal-gemma-4-orchestration — Gemma 4 orchestration deep dive from a different angle