raydata.co AI crawler status — empirical brief

The question

What do GPTBot, ClaudeBot, and PerplexityBot logs actually show for raydata.co — is the site being crawled, at what frequency, which pages are being indexed, and are there any robots.txt or Cloudflare WAF rules blocking AI crawlers?

Headline finding

raydata.co is actively blocking AI crawlers at two layers — Cloudflare’s managed Disallow: / robots.txt for ten AI bot user-agents, AND zone-level ai_bots_protection: "block" that 403s ClaudeBot at the WAF edge. GPTBot, PerplexityBot, Google-Extended, and CCBot show zero hits in the last 7 days. ClaudeBot and OAI-SearchBot probe /robots.txt daily, see Disallow: /, and walk away (16-26 hits/day for ClaudeBot, ~10/day for OAI-SearchBot). The only AI crawler reaching content pages is ChatGPT-User — the user-initiated live-retrieval bot (~9 hits over 7 days) — which is a useful visibility signal, but two of its four content fetches were 404s because old /p/... URLs have been replaced with /articles/... paths and ChatGPT still has the dead URLs cached.

GEO is not unblocked. Step 0 fails. None of the Princeton-validated techniques can deliver lift while the site is blocking AI training crawlers and search-time indexing crawlers at the WAF.

What we already know (from the vault)

Agent SEO — State of the Discipline explicitly listed “What does raydata.co’s actual current crawl status look like in GPTBot / ClaudeBot / PerplexityBot logs? (Need server-log audit.)” as the first open follow-up. This brief closes that loop with empirical data.
Publishing for Agents — Spec for raydata.co noted raydata.co is closer to a docs-site shape than a Substack newsletter and recommended a llms.txt + structured-content posture. As of 2026-05-09 there is no /llms.txt deployed (404).
Brand Architecture documents raydata.co as the umbrella site, hosted on Vercel (the older “Mr. Ben / Data, DAGs, and Doodles” personal site, awaiting rebuild), with Cloudflare in front for DNS/proxy. Cloudflare zone is on the Free plan, account 5d4a0efe..., zone ID 125cd3e6....
~/rdco-vault/04-tooling/2026-04-24-resend-setup.md confirms raydata.co was already verified in Cloudflare DNS prior to Resend + Sanity Check work.

What the web says

Cloudflare’s managed robots.txt is on by default for many configurations. is_robots_txt_managed: true causes Cloudflare to inject (or prepend onto an existing robots.txt) a Disallow: / block for known AI training crawlers — available on all plans including Free (Cloudflare blog, control content use for AI training).
ai_bots_protection: "block" is a separate WAF-layer enforcement. It blocks verified AI crawlers AND unverified bots that behave similarly, returning HTTP 403 at the edge regardless of robots.txt compliance (Cloudflare docs — Block AI bots).
OpenAI runs three distinct crawlers, each independently controllable: GPTBot (training data), OAI-SearchBot (ChatGPT search index), and ChatGPT-User (real-time fetches when a user prompts ChatGPT to look at a URL). Because ChatGPT-User requests are user-initiated, robots.txt may not apply to them, and they are the strongest visibility signal (OpenAI bots overview, Am I Cited — GPTBot vs OAI-SearchBot).
Anthropic similarly distinguishes ClaudeBot (training/general crawl) from Claude-User and Claude-SearchBot in 2025-2026 docs. Cloudflare’s managed list disallows ClaudeBot but not Claude-User.
PerplexityBot is reportedly under-represented in robots.txt disallow directives industry-wide (<5% of top domains per Cloudflare’s audit). Cloudflare’s managed list does NOT include it by default — yet we see zero PerplexityBot hits, suggesting Perplexity isn’t independently choosing to crawl raydata.co (low domain authority + small surface area).
Industry data: 2025 audits show GPTBot, ClaudeBot, and PerplexityBot do NOT fetch llms.txt at meaningful scale; llms.txt remains useful for IDE-side agents (Cursor, Aider) but not for production LLM training/retrieval crawlers (Longato — llms.txt 2025 audit).

Empirical findings — raydata.co specifically

All data pulled 2026-05-09 ~01:35 EDT via Cloudflare GraphQL Analytics API, zone 125cd3e614d6fd3c369887fb3081bfd5 (raydata.co), windowed to 2026-05-02 00:00 UTC through 2026-05-09 05:00 UTC.

Configuration state (zone settings)

ai_bots_protection      = block          # ENFORCING - returns 403 at edge for AI bots
content_bots_protection = disabled
crawler_protection      = disabled
is_robots_txt_managed   = true           # Cloudflare auto-injects Disallow:/ block
cf_robots_variant       = off
fight_mode              = false          # generic Bot Fight Mode is off
plan                    = Free Website

Live robots.txt body served at raydata.co/robots.txt

Cloudflare-managed content (200 OK on GET, 1738 bytes). Disallows ten AI crawlers explicitly:

User-agent: *
Content-Signal: search=yes,ai-train=no
Allow: /

User-agent: Amazonbot                Disallow: /
User-agent: Applebot-Extended        Disallow: /
User-agent: Bytespider               Disallow: /
User-agent: CCBot                    Disallow: /
User-agent: ClaudeBot                Disallow: /
User-agent: CloudflareBrowserRenderingCrawler  Disallow: /
User-agent: Google-Extended          Disallow: /
User-agent: GPTBot                   Disallow: /
User-agent: meta-externalagent       Disallow: /

Note: PerplexityBot, OAI-SearchBot, ChatGPT-User, Claude-User, and Claude-SearchBot are NOT in the managed disallow list. They get the wildcard User-agent: * rule with Content-Signal: search=yes, ai-train=no, Allow: / — i.e., search indexing is permitted but training is not.

Side note: at the apex https://raydata.co the robots.txt 308-redirects to https://www.raydata.co/robots.txt. A HEAD to /robots.txt returns 404 with a Vercel 404 page (the underlying Vercel project has no robots.txt route), but a GET is intercepted by Cloudflare and returns the managed body. The 200 GET is what crawlers actually see.

AI crawler hits per day (last 7 days)

Date (UTC)	ClaudeBot	ChatGPT-User	OAI-SearchBot	CCBot
2026-05-02	20	0	11	0
2026-05-03	16	0	11	0
2026-05-04	23	0	12	0
2026-05-05	18	8	10	0
2026-05-06	26	4	11	1
2026-05-07	16	0	10	1
2026-05-08	2	1	9	0
7-day total	121	13	74	2

Path-level breakdown

ClaudeBot: 100% of hits are /robots.txt (200 OK with the disallow body). Zero content-page fetches. Anthropic’s crawler is doing its job — it reads robots.txt every few hours and respects the Disallow: /.

OAI-SearchBot: 100% of hits are /robots.txt (200 OK). Despite NOT being in the disallow list, OpenAI’s search crawler is also only reading robots.txt and not crawling content. Possibly because its general policy is to back off from sites where the broader OpenAI suite (GPTBot) is disallowed. Possibly because raydata.co has no inbound links pointing it at content URLs to fetch. Either way: zero content indexing.

ChatGPT-User (live retrievals — the actually interesting signal):

Date	Path	Status	Count
2026-05-08	`/articles/analytics-crafting-a-way-forward`	200	1
2026-05-06	`/p/getting-weird-with-squarely`	404	4
2026-05-05	`/p/analytics-antenna-and-good-guy-amazon`	404	1
2026-05-05	`/`	200	4
2026-05-05	`/`	308	3

ChatGPT users prompted the model 13 times in the last week to fetch raydata.co content. 5 of those 13 requests went to dead URLs (/p/... was the old Substack-style path; live URLs are /articles/...). The ChatGPT memory layer has stale URLs from before the path migration.

WAF 403 enforcement — confirmed at edge:

ClaudeBot is being 403’d when it tries to fetch sitemap files. Daily counts of ClaudeBot 403s on /sitemap.xml + /sitemap-index.xml:

Date	sitemap.xml 403s	sitemap-index.xml 403s
2026-05-02	23	16
2026-05-03	16	15
2026-05-04	27	14
2026-05-05	12	12

This proves ai_bots_protection: "block" is actually returning 403s — not just modifying robots.txt. Even if we removed ClaudeBot from the disallow list, the WAF rule would still block it.

GPTBot, PerplexityBot, Google-Extended: zero requests of any kind in 7 days. They read robots.txt at most rarely (or not at all in this window) and never attempted content URLs. PerplexityBot in particular appears to not be actively crawling raydata.co at all.

Other relevant traffic patterns

The last 24h top-30 user-agents are dominated by:

/healthcheck from Vercel infrastructure (145 hits)
WordPress vulnerability scanners crawling /wp-* paths (hundreds of hits, mostly 301/404)
Real human iPhone Safari traffic on / and /articles/...
SERankingBacklinksBot, vercel-fetch, generic browsers
A small amount of /api/decisions polling (likely the agent-channel infrastructure)

The site is alive and being hit, just not by AI training/search crawlers.

Convergences and contradictions

Convergences

Vault, web sources, and empirical data all agree: GPTBot/ClaudeBot/PerplexityBot respect robots.txt at the high-90%+ level. The previous research brief’s caveat “AI crawlers do read robots.txt” is empirically validated for raydata.co.
ChatGPT-User reaching content pages despite robots.txt also matches the docs — user-initiated requests bypass robots.txt rules.
Cloudflare’s managed robots.txt is exactly what its docs describe — same UA list, same Disallow: / syntax, same Content-Signal directive.

Contradictions / surprises

The HEAD vs GET behavior of /robots.txt on Vercel + Cloudflare is unusual and worth documenting. A naive monitoring tool that uses HEAD would conclude the site has no robots.txt; a real crawler doing GET would see the Cloudflare-managed disallow. The crawler view is what matters.
OAI-SearchBot is NOT in Cloudflare’s disallow list, yet shows zero content fetches. Possible explanation: OpenAI’s crawler treats GPTBot disallow as a coupled signal across the product family. Or raydata.co simply isn’t surfaced by enough inbound links for it to bother.
ChatGPT-User 404s reveal a content-routing bug, not a crawler bug: the site’s path migration from /p/... (old Substack-style) to /articles/... left no 301 redirects, so AI memory of old URLs hits dead ends.

Synthesis for RDCO

The GEO bet, as currently framed, is blocked at Step 0. The Apr 22 brief recommended writing 15-25 citation-dense reference pieces to get LLM-cited. None of that work matters while:

Cloudflare’s ai_bots_protection: "block" is firing 403s at ClaudeBot at the WAF.
The managed robots.txt is telling GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Bytespider, Amazonbot, and meta-externalagent to stay off the entire site.
Even the cooperative bots (OAI-SearchBot, ChatGPT-User) are bouncing off either robots.txt or 404s.

The asymmetric upside the Apr 22 brief identified — citation-dense writing helps low-authority domains 3-5x more than high-authority ones — applies to content that LLMs can actually read. We are not in that regime.

Decision needed: opt in or stay out? This is a real tradeoff, not a no-brainer:

Argument for staying blocked: training-data extraction without compensation is the same dynamic the founder has flagged in ~/rdco-vault/06-reference/2026-04-19-michael-dean-elon-writing-prize-scheme.md. The Cloudflare default exists because most operators don’t want their content training competitor models for free. The published Sanity Check pieces have value; giving them to Anthropic/OpenAI to train future models is a real cost.
Argument for opting in: the GEO thesis from the Apr 22 brief is explicitly about getting cited, not getting trained. The right configuration is Content-Signal: search=yes, ai-train=no (already set) PLUS allowing search-time crawlers (OAI-SearchBot, PerplexityBot, ClaudeBot if Anthropic respects ai-train=no, Claude-SearchBot) while still blocking training crawlers (GPTBot, CCBot, Bytespider). The current managed list is too aggressive — it disallows ClaudeBot wholesale, but ClaudeBot’s 2026 documented behavior is that it respects Content-Signal directives, so a Disallow: / is a sledgehammer.

Concrete next moves (recommend, do not auto-execute):

Decide the policy. Founder picks between: (a) keep the current Cloudflare defaults, accept GEO bet is dead; (b) opt in to AI search/citation but not training — disable ai_bots_protection, replace managed robots.txt with a hand-written file that allows search-time crawlers and the Content-Signal: search=yes, ai-train=no directive; (c) opt in fully — allow training too, treat raydata.co content as deliberately given to the commons. Recommend option (b).
Fix the content-routing bug regardless of policy. The 404s on /p/getting-weird-with-squarely and /p/analytics-antenna-and-good-guy-amazon are dropping live ChatGPT users into a hole right now. Add /p/* -> /articles/* 301 redirects in the Vercel Next.js config. This is reversible, has zero downside, and rescues real user traffic.
If opting in (option b): remove is_robots_txt_managed, write our own /public/robots.txt with explicit Allow for OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-User, Claude-SearchBot, Googlebot, Bingbot — and explicit Disallow for GPTBot, CCBot, Bytespider, Amazonbot. Set ai_bots_protection to disabled. Then deploy /llms.txt per the Apr 22 publishing-for-agents spec. Then re-run this empirical audit in 30 days to verify content URLs are being fetched.
Decide whether to actually rebuild raydata.co before opting in. Per the Brand Architecture doc, the current site is the older “Mr. Ben / Data, DAGs, and Doodles” personal site awaiting rebuild as the umbrella. Opening the gates BEFORE the rebuild means LLMs train/index on content the founder explicitly considers stale. Sequence-wise: rebuild umbrella -> opt in -> measure.

Bottom line: GEO can’t be the next thing the founder ships. The unblocking decision (and ideally the umbrella rebuild) has to come first. Estimated effort to unblock: ~2 hours of config work IF the policy is decided. Estimated effort to rebuild umbrella first: separate scoping exercise — last estimated at hq.raydata.co-class effort (multi-day).

Open follow-ups

Should the founder draft a positioning piece on the “ai-train=no, search=yes” Content-Signal stance? It’s an opinionated configuration that could itself become a citation-bait artifact. Worth a brief.
What does Sanity Check (sc.raydata.co) look like under the same audit? Different Cloudflare Pages project, possibly different defaults, and Sanity Check is the surface that should most aggressively want LLM citation. Run the same query against the sc subdomain.
Does Cloudflare’s Free plan expose enough analytics granularity for a recurring monthly crawler-visibility audit, or does this require Pro plan ($20/mo) for proper Bot Analytics? Need to test once ai_bots_protection is disabled and there’s actual AI bot traffic to measure.
Once the robots.txt is hand-written, what’s the right Content-Signal combination for User-agent: * vs per-bot rules? The default search=yes, ai-train=no may be too broad; e.g., we may want to allow Anthropic/OpenAI ai-input (RAG/grounding) but not ai-train.
The /p/* -> /articles/* 404s suggest a broader URL-stability problem. Worth a vault audit of inbound links from external sources (X posts, newsletter shares) to see how many other dead URLs are circulating.

Sources

Empirical: Cloudflare GraphQL Analytics API, zone 125cd3e614d6fd3c369887fb3081bfd5 (raydata.co), httpRequestsAdaptiveGroups filtered by userAgent_like per AI bot UA, queried 2026-05-09 ~01:35 EDT.
Empirical: Cloudflare Bot Management settings endpoint /zones/<id>/bot_management, 2026-05-09.
Empirical: curl https://www.raydata.co/robots.txt, 2026-05-09.
Cloudflare — Block AI bots docs
Cloudflare — managed robots.txt setting
Cloudflare blog — Control content use for AI training
Cloudflare blog — Declare your AIndependence (one-click block)
Cloudflare AI Crawl Control overview
OpenAI — Overview of OpenAI Crawlers
Am I Cited — GPTBot vs OAI-SearchBot
Momentic — Top AI Search Crawlers + User Agents
Longato — llms.txt 2025 audit
Apr 22 GEO state-of-discipline brief
Apr 22 Publishing for Agents spec
Brand Architecture