“I Let Claude Code Autonomously Run My Meta Ads” — Giorgio Liapakis

A 31-day live experiment: full autonomous agent control of a Meta Ads account, $1,500 budget, one human objective set at the start, two minutes of input per day. The results were instructive in ways that had little to do with whether it hit the target.

The Experiment

Liapakis handed a Meta Ads account to Claude Code with a single objective: acquire newsletter subscribers at under $2.50 CPL. He set a 30-day frame, gave it budget control, and stepped back.

Results: 243 leads at $6.14 CPL — 2.5x over target, $1,500 spent.

The most honest line in the piece: the miss wasn’t random noise. The framing of “30-day experiment” induced conservative behavior from the start. The agent hedged because it was told it was an experiment.

The Daily Loop

The system ran without persistent memory across sessions. Each day:

Fresh session (no memory carryover between days)
Subprocess reviewed all prior daily logs to reconstruct context
Pulled current performance data from Meta
Structured decision-making pass
Executed changes — or deliberately did nothing
Documented reasoning
Git commit for tracking

The only human input was /let-it-rip to kick off each day’s session. About two minutes.

By the end, the agent had generated 50+ ad variants across 8 format categories and produced roughly 5,500 lines of reasoning documentation — a trace of every decision and why.

What the Agent Got Right

Creative direction: The agent developed its own quality heuristics — notably the “Local Pizza Shop Test,” a self-defined bar for whether an ad felt authentically local and unpolished vs. corporate. “Ugly” whiteboard and sketch ads consistently outperformed polished creative. The agent noticed this pattern and leaned into it.

Targeting baked into creative: It embedded audience language directly into ad visuals (“For Growth Marketers”) rather than relying purely on Meta’s targeting layer. A precision move that required understanding how ad copy and targeting interact.

Volume and iteration: 50+ variants at this pace and quality level is genuinely impressive autonomous creative work. This is Level 3 territory by the Four Levels of AI Use framework — work that simply wasn’t economical before agents made iteration cheap.

What Broke

Day 16 — The Lead Quality Crisis: CPL looked fine on the dashboard. But when Liapakis examined the actual leads, quality had degraded significantly. The agent had no way to know this — lead quality wasn’t in the signal it was optimizing against. It was doing exactly what it was told to do. This is the core tension: agents optimize for what’s measurable, not what matters.

The Manual Override: Liapakis intervened once, adding an email validation gate to filter bad leads. CPL spiked to $50+. One human override nearly destroyed all progress. This outcome is counterintuitive but important — the system was tuned to a specific optimization landscape, and a blunt structural change reconfigured that landscape entirely. Undoing this required weeks of recovery.

The Framing Problem

The most transferable insight in the piece: “Frame the objective shapes agent behavior completely.”

Telling the agent it was running a “30-day experiment” made it act like an experimenter — measuring and learning rather than aggressively acquiring. If the stated objective had been “build a sustainable acquisition engine,” the behavior would have been materially different.

This is not a quirk of Claude Code. It’s a feature of how agents interpret scope. The agent did what it was asked to do — it ran a cautious, measured experiment. If you want an engine, say engine.

Where Human Value Actually Lives

Three roles where humans added irreplaceable value:

Setting the right objective — not the proxy metric, but the actual goal (quality leads vs. lead count)
Defining quality beyond metrics — the agent had no access to what a good lead looks like downstream; humans do
Knowing when not to override — the email gate intervention demonstrates that human intuition about what to fix can be wrong in ways agents can’t warn you about

The agent’s failure to flag lead quality degradation isn’t a bug in the agent. It’s a design gap in what was instrumented. Agents can only surface what’s in their signal. Human value is knowing what should be in the signal.

The Daily Log as Trace-Based Learning

The subprocess design — where each session reviewed prior daily logs before acting — is a practical implementation of what Better Harness: Evals Hill-Climbing calls production trace learning. The agent used its own historical outputs as the primary eval signal. Not a formal harness, but the same epistemology: don’t guess what’s getting better, read the trace.

The 5,500 lines of reasoning documentation are a byproduct of this design. That corpus is also the raw material for a formal harness eval set — if someone wanted to build one.

Vault Connections

Four Levels of AI Use — This experiment is Level 4. It’s not a generic Meta Ads tool; it’s a custom autonomous loop shaped to one account’s creative strategy, quality heuristics, and decision cadence. No off-the-shelf product does this. The “Local Pizza Shop Test” is an agent-developed heuristic that emerged from this specific account’s data — exactly the kind of idiosyncratic artifact that makes Level 4 defensible.
Better Harness: Evals Hill-Climbing — The daily log subprocess is informal trace-based learning. The gap that caused the lead quality crisis (no signal on downstream quality) is precisely the gap a proper harness eval would have caught. The experiment demonstrates both what trace learning can do and what it misses when the trace is incomplete.
Ramp AI Adoption Playbook — The “remove every constraint” principle from Geoff Charles maps directly to the /let-it-rip daily protocol. Liapakis didn’t just give the agent access — he removed friction from his own oversight process. Two minutes a day is what it looks like to genuinely get out of the way.
Products for Agents — Meta Ads API becomes a product-for-agents here. The agent consumed it directly — pulling performance data, creating campaigns, iterating on creative — without a human intermediary translating between interface and intent. Any ad platform with a clean API and structured performance data is a natural substrate for this pattern.
Squarely Growth Strategy — see Actionable for Squarely section below.

Actionable for Squarely

The Direct Parallel

Squarely’s growth strategy already identifies Apple Search Ads and Amazon Ads as viable paid acquisition channels. The Liapakis experiment maps directly — the difference is channel (Meta vs. Apple Search Ads) and conversion event (newsletter sub vs. app install).

App installs are a harder objective than newsletter subs:

Deeper funnel: Email opt-in is one step. App install requires store visit → install → onboarding → activation. CPL equivalent (cost-per-install) is higher, and cost-per-activated-user higher still.
Retention matters: A newsletter sub is a lead that decays slowly. An app install that doesn’t convert to a daily active user is waste. The lead quality crisis at Day 16 would arrive faster with app installs — poor quality is visible in D1/D7 retention, not six weeks later.
Creative surface is different: App install ads need to communicate the core mechanic (what do you actually do in this app?) in 6-15 seconds. Whiteboard/sketch aesthetic might still work — Wordle-era puzzle content proved raw visual communication beats polish — but the brief is tighter.

Applying the Framing Lesson

The key takeaway from Liapakis is: don’t frame it as a test. If Squarely runs an autonomous acquisition loop, the objective should be “build a sustainable daily active user base” — not “run a 30-day ads experiment.” The framing determines the agent’s risk posture. An engine-building frame produces compounding behavior; an experiment frame produces cautious behavior.

The growth strategy already articulates the right frame: a paid loop that feeds into the iOS viral engine, not a one-off campaign. That’s the objective to give an autonomous agent.

What a Squarely Autonomous Loop Could Look Like

Following the Liapakis architecture:

Signal: Apple Search Ads performance data (impressions, installs, cost) + Firebase D1/D7 retention by cohort (the quality signal Liapakis was missing)
Daily subprocess: Review prior logs + pull current cohort retention data — not just install cost
Objective: Minimize cost-per-D7-retained-user, not cost-per-install
Creative iteration: Ad variations keyed to the puzzle mechanic — the share image format, puzzle grid visuals, “today’s challenge” framing
Human role: Define what a retained user looks like upstream; don’t touch active campaigns mid-run

The critical improvement over the Liapakis design: instrument lead quality from day one. Build the D7 retention signal into the daily loop before the agent starts iterating on CPL. Fixing this after the fact is the mistake he made.

Timing

This is Phase 3+ work per the growth strategy — after the iOS app has enough daily active users to generate meaningful retention data for cohort analysis. Running an autonomous acquisition loop without retention signal is exactly the lead quality crisis scenario. Wait until the funnel is instrumented, then let it run.

Summary

The experiment missed its cost target but produced something more valuable: a working blueprint for autonomous paid acquisition and a clear map of where agents break down (unmeasured quality, framing-induced conservatism, brittle response to human override). The CPL miss is recoverable. The design learnings are durable.

The most important thing Liapakis built wasn’t the campaign — it was the daily trace and reasoning archive. That corpus is the raw material for every future iteration. The agent that ran this experiment could run a better one next month, because the logs exist.