“How RAM and GPU/CPU affects LLM Inference Performance” — @DataEngineeringCentral (Daniel Beach)

Why this is in the vault

A data engineer runs a small, honest experiment: spin up Ollama on progressively bigger CPU+RAM boxes, then add a GPU, and measure how long a single inference takes. The whole thing is scrappy (no batching, one prompt, one model) but the shape of the answer is correct and useful — throwing CPU and RAM at LLM inference does basically nothing; throwing a GPU at it halves latency immediately. Files as a reality-check on the “we’ll just self-host to avoid API costs” argument that keeps surfacing in agent-deployer conversations.

⚠️ Sponsorship

Explicit sponsor block for Estuary (Right-Time Data Platform, CDC focus) placed between the intro and the technical section, with author clearly labeling “Today’s post is sponsored by Estuary.” This is the same Estuary pattern we’ve tagged on SeattleDataGuy notes — they sponsor across the data-engineering newsletter circuit. Estuary doesn’t appear in the technical body of the article and has no bearing on the LLM-inference findings; disclosure is clean.

Secondary: author plugs Carolina Cloud as his GPU-instance provider of choice and explicitly labels it “not a sponsored post.” Treat as a real recommendation rather than paid — but worth noting he’s promoting two infra vendors in the same issue.

The core argument

Beach asks: if AI inference costs rise or providers get greedy, can I just run Ollama myself? To find out, he installs Ollama on a range of Carolina Cloud boxes and runs the same prompt (“Explain why data engineers should care about LLM inference performance”) against llama3, measuring wall-clock latency.

Results:

Instance	Latency
8 vCPU / 16 GiB RAM	100.83 s
8 vCPU / 48 GiB RAM	68.36 s
8 vCPU / 148 GiB RAM	94.52 s
78 vCPU / 48 GiB RAM	116.89 s (!)
8 vCPU / 364 GiB RAM	140.59 s (!!)
8 vCPU / 16 GiB + RTX 5090 (32 GB GDDR7)	61.99 s
Bigger AWS NVIDIA instance	48.93 s

Two honest takeaways:

CPU and system RAM don’t help, and past a certain point they actively hurt. Beach admits he assumed Ollama would “magically” use whatever RAM you threw at it. It doesn’t. More system RAM with no GPU didn’t accelerate inference — the 364 GiB run was the slowest of the pure-CPU set.
A mid-tier consumer GPU immediately beats the best CPU-only config, and a bigger GPU beats it more. This is the entire headline: if you want inference latency below ~60s for a 7B-class model, the answer is GPU. Nothing else in the CPU/RAM dimension substitutes.

Beach closes self-deprecating (“I’m just a dude in the corner of the internet looking under rocks”) and flags future articles on smaller models, different packages, different configs.

Mapping against Ray Data Co

Strength: medium. Three live mappings:

Self-host-to-avoid-API-fees math. The RDCO agent stack currently runs on Anthropic API (Opus 4.7 with 1M context for the main harness, plus subagents). The recurring temptation — when API costs spike or during any “let’s be more frugal” moment — is “we could just run a local model.” This article is concrete evidence that to match even a middling hosted inference latency on a 7B model, you need real GPU hardware (RTX 5090 minimum, and that still took 62s for one prompt that Claude returns in ~2s). For our use case (agents that fire every 15min and make many short decisions), self-hosted Ollama on anything less than a serious GPU cluster would be unusably slow. Cross-link: 2026-04-15-thariq-claude-code-session-management-1m-context — the productivity of the harness depends on fast turnaround; slow local inference kills the loop.
“Verification layer” thesis reinforcement. One of RDCO’s working beliefs is that the deterministic verification layer around the LLM (audit scripts, invariant checks, typed graph queries) is the defensible asset, not the model itself. Beach’s experiment is a micro-example: he builds the verification layer (wall-clock timer + token counts) around Ollama to learn what the model actually does. The pattern generalizes — you validate the black box by instrumenting its edges, not by trying to understand its internals.
Permission to publish scrappy, honest experiments. Beach opens by saying “nothing scientific here” and closes by admitting he was wrong about how Ollama uses RAM. The piece works because of the admission, not despite it. Sanity Check newsletter voice calibration: the RDCO newsletter can do the same move — run a small experiment, publish the actual numbers including the surprising ones, admit what you got wrong. Cross-link: ../01-projects/sanity-check-newsletter or the draft-review / voice-match skills.

Gap surfaced: no vault note yet on “the economics of self-hosted LLM inference vs API” — this piece is the first concrete data point. Candidate for a future concept article once we have 2-3 more data points (e.g. Commoncog or Semi-Structured on the same topic).

2026-04-15-thariq-claude-code-session-management-1m-context — context-rot and the cost of long sessions; relates to why API latency matters for agent loops
2026-04-20-stratechery-tsmc-earnings-n3-fabs-nvidia-ramp — macro picture of GPU supply/demand; the “why GPUs are expensive” background
2026-04-07-seattle-data-guy-noisy-data-quality-checks — Estuary sponsor pattern precedent in an adjacent data-eng newsletter
../02-sops/2026-04-19-newsletter-output-invariants — verification-layer discipline applied to this skill itself

Copyright note