raydata.co AI crawler status — empirical brief
The question
What do GPTBot, ClaudeBot, and PerplexityBot logs actually show for raydata.co — is the site being crawled, at what frequency, which pages are being indexed, and are there any robots.txt or Cloudflare WAF rules blocking AI crawlers?
Headline finding
raydata.co is actively blocking AI crawlers at two layers — Cloudflare’s managed Disallow: / robots.txt for ten AI bot user-agents, AND zone-level ai_bots_protection: "block" that 403s ClaudeBot at the WAF edge. GPTBot, PerplexityBot, Google-Extended, and CCBot show zero hits in the last 7 days. ClaudeBot and OAI-SearchBot probe /robots.txt daily, see Disallow: /, and walk away (16-26 hits/day for ClaudeBot, ~10/day for OAI-SearchBot). The only AI crawler reaching content pages is ChatGPT-User — the user-initiated live-retrieval bot (~9 hits over 7 days) — which is a useful visibility signal, but two of its four content fetches were 404s because old /p/... URLs have been replaced with /articles/... paths and ChatGPT still has the dead URLs cached.
GEO is not unblocked. Step 0 fails. None of the Princeton-validated techniques can deliver lift while the site is blocking AI training crawlers and search-time indexing crawlers at the WAF.
What we already know (from the vault)
- Agent SEO — State of the Discipline explicitly listed “What does raydata.co’s actual current crawl status look like in GPTBot / ClaudeBot / PerplexityBot logs? (Need server-log audit.)” as the first open follow-up. This brief closes that loop with empirical data.
- Publishing for Agents — Spec for raydata.co noted raydata.co is closer to a docs-site shape than a Substack newsletter and recommended a llms.txt + structured-content posture. As of 2026-05-09 there is no
/llms.txtdeployed (404). - Brand Architecture documents raydata.co as the umbrella site, hosted on Vercel (the older “Mr. Ben / Data, DAGs, and Doodles” personal site, awaiting rebuild), with Cloudflare in front for DNS/proxy. Cloudflare zone is on the Free plan, account
5d4a0efe..., zone ID125cd3e6.... - ~/rdco-vault/04-tooling/2026-04-24-resend-setup.md confirms
raydata.cowas already verified in Cloudflare DNS prior to Resend + Sanity Check work.
What the web says
- Cloudflare’s managed robots.txt is on by default for many configurations.
is_robots_txt_managed: truecauses Cloudflare to inject (or prepend onto an existing robots.txt) aDisallow: /block for known AI training crawlers — available on all plans including Free (Cloudflare blog, control content use for AI training). ai_bots_protection: "block"is a separate WAF-layer enforcement. It blocks verified AI crawlers AND unverified bots that behave similarly, returning HTTP 403 at the edge regardless of robots.txt compliance (Cloudflare docs — Block AI bots).- OpenAI runs three distinct crawlers, each independently controllable:
GPTBot(training data),OAI-SearchBot(ChatGPT search index), andChatGPT-User(real-time fetches when a user prompts ChatGPT to look at a URL). BecauseChatGPT-Userrequests are user-initiated, robots.txt may not apply to them, and they are the strongest visibility signal (OpenAI bots overview, Am I Cited — GPTBot vs OAI-SearchBot). - Anthropic similarly distinguishes
ClaudeBot(training/general crawl) fromClaude-UserandClaude-SearchBotin 2025-2026 docs. Cloudflare’s managed list disallowsClaudeBotbut notClaude-User. - PerplexityBot is reportedly under-represented in robots.txt disallow directives industry-wide (<5% of top domains per Cloudflare’s audit). Cloudflare’s managed list does NOT include it by default — yet we see zero PerplexityBot hits, suggesting Perplexity isn’t independently choosing to crawl raydata.co (low domain authority + small surface area).
- Industry data: 2025 audits show GPTBot, ClaudeBot, and PerplexityBot do NOT fetch llms.txt at meaningful scale; llms.txt remains useful for IDE-side agents (Cursor, Aider) but not for production LLM training/retrieval crawlers (Longato — llms.txt 2025 audit).
Empirical findings — raydata.co specifically
All data pulled 2026-05-09 ~01:35 EDT via Cloudflare GraphQL Analytics API, zone 125cd3e614d6fd3c369887fb3081bfd5 (raydata.co), windowed to 2026-05-02 00:00 UTC through 2026-05-09 05:00 UTC.
Configuration state (zone settings)
ai_bots_protection = block # ENFORCING - returns 403 at edge for AI bots
content_bots_protection = disabled
crawler_protection = disabled
is_robots_txt_managed = true # Cloudflare auto-injects Disallow:/ block
cf_robots_variant = off
fight_mode = false # generic Bot Fight Mode is off
plan = Free Website
Live robots.txt body served at raydata.co/robots.txt
Cloudflare-managed content (200 OK on GET, 1738 bytes). Disallows ten AI crawlers explicitly:
User-agent: *
Content-Signal: search=yes,ai-train=no
Allow: /
User-agent: Amazonbot Disallow: /
User-agent: Applebot-Extended Disallow: /
User-agent: Bytespider Disallow: /
User-agent: CCBot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: CloudflareBrowserRenderingCrawler Disallow: /
User-agent: Google-Extended Disallow: /
User-agent: GPTBot Disallow: /
User-agent: meta-externalagent Disallow: /
Note: PerplexityBot, OAI-SearchBot, ChatGPT-User, Claude-User, and Claude-SearchBot are NOT in the managed disallow list. They get the wildcard User-agent: * rule with Content-Signal: search=yes, ai-train=no, Allow: / — i.e., search indexing is permitted but training is not.
Side note: at the apex https://raydata.co the robots.txt 308-redirects to https://www.raydata.co/robots.txt. A HEAD to /robots.txt returns 404 with a Vercel 404 page (the underlying Vercel project has no robots.txt route), but a GET is intercepted by Cloudflare and returns the managed body. The 200 GET is what crawlers actually see.
AI crawler hits per day (last 7 days)
| Date (UTC) | GPTBot | ClaudeBot | PerplexityBot | ChatGPT-User | OAI-SearchBot | Google-Ext | CCBot |
|---|---|---|---|---|---|---|---|
| 2026-05-02 | 0 | 20 | 0 | 0 | 11 | 0 | 0 |
| 2026-05-03 | 0 | 16 | 0 | 0 | 11 | 0 | 0 |
| 2026-05-04 | 0 | 23 | 0 | 0 | 12 | 0 | 0 |
| 2026-05-05 | 0 | 18 | 0 | 8 | 10 | 0 | 0 |
| 2026-05-06 | 0 | 26 | 0 | 4 | 11 | 0 | 1 |
| 2026-05-07 | 0 | 16 | 0 | 0 | 10 | 0 | 1 |
| 2026-05-08 | 0 | 2 | 0 | 1 | 9 | 0 | 0 |
| 7-day total | 0 | 121 | 0 | 13 | 74 | 0 | 2 |
Path-level breakdown
ClaudeBot: 100% of hits are /robots.txt (200 OK with the disallow body). Zero content-page fetches. Anthropic’s crawler is doing its job — it reads robots.txt every few hours and respects the Disallow: /.
OAI-SearchBot: 100% of hits are /robots.txt (200 OK). Despite NOT being in the disallow list, OpenAI’s search crawler is also only reading robots.txt and not crawling content. Possibly because its general policy is to back off from sites where the broader OpenAI suite (GPTBot) is disallowed. Possibly because raydata.co has no inbound links pointing it at content URLs to fetch. Either way: zero content indexing.
ChatGPT-User (live retrievals — the actually interesting signal):
| Date | Path | Status | Count |
|---|---|---|---|
| 2026-05-08 | /articles/analytics-crafting-a-way-forward | 200 | 1 |
| 2026-05-06 | /p/getting-weird-with-squarely | 404 | 4 |
| 2026-05-05 | /p/analytics-antenna-and-good-guy-amazon | 404 | 1 |
| 2026-05-05 | / | 200 | 4 |
| 2026-05-05 | / | 308 | 3 |
ChatGPT users prompted the model 13 times in the last week to fetch raydata.co content. 5 of those 13 requests went to dead URLs (/p/... was the old Substack-style path; live URLs are /articles/...). The ChatGPT memory layer has stale URLs from before the path migration.
WAF 403 enforcement — confirmed at edge:
ClaudeBot is being 403’d when it tries to fetch sitemap files. Daily counts of ClaudeBot 403s on /sitemap.xml + /sitemap-index.xml:
| Date | sitemap.xml 403s | sitemap-index.xml 403s |
|---|---|---|
| 2026-05-02 | 23 | 16 |
| 2026-05-03 | 16 | 15 |
| 2026-05-04 | 27 | 14 |
| 2026-05-05 | 12 | 12 |
This proves ai_bots_protection: "block" is actually returning 403s — not just modifying robots.txt. Even if we removed ClaudeBot from the disallow list, the WAF rule would still block it.
GPTBot, PerplexityBot, Google-Extended: zero requests of any kind in 7 days. They read robots.txt at most rarely (or not at all in this window) and never attempted content URLs. PerplexityBot in particular appears to not be actively crawling raydata.co at all.
Other relevant traffic patterns
The last 24h top-30 user-agents are dominated by:
/healthcheckfrom Vercel infrastructure (145 hits)- WordPress vulnerability scanners crawling
/wp-*paths (hundreds of hits, mostly 301/404) - Real human iPhone Safari traffic on
/and/articles/... SERankingBacklinksBot,vercel-fetch, generic browsers- A small amount of
/api/decisionspolling (likely the agent-channel infrastructure)
The site is alive and being hit, just not by AI training/search crawlers.
Convergences and contradictions
Convergences
- Vault, web sources, and empirical data all agree: GPTBot/ClaudeBot/PerplexityBot respect robots.txt at the high-90%+ level. The previous research brief’s caveat “AI crawlers do read robots.txt” is empirically validated for raydata.co.
- ChatGPT-User reaching content pages despite robots.txt also matches the docs — user-initiated requests bypass robots.txt rules.
- Cloudflare’s managed robots.txt is exactly what its docs describe — same UA list, same
Disallow: /syntax, sameContent-Signaldirective.
Contradictions / surprises
- The HEAD vs GET behavior of
/robots.txton Vercel + Cloudflare is unusual and worth documenting. A naive monitoring tool that uses HEAD would conclude the site has no robots.txt; a real crawler doing GET would see the Cloudflare-managed disallow. The crawler view is what matters. - OAI-SearchBot is NOT in Cloudflare’s disallow list, yet shows zero content fetches. Possible explanation: OpenAI’s crawler treats GPTBot disallow as a coupled signal across the product family. Or raydata.co simply isn’t surfaced by enough inbound links for it to bother.
- ChatGPT-User 404s reveal a content-routing bug, not a crawler bug: the site’s path migration from
/p/...(old Substack-style) to/articles/...left no 301 redirects, so AI memory of old URLs hits dead ends.
Synthesis for RDCO
The GEO bet, as currently framed, is blocked at Step 0. The Apr 22 brief recommended writing 15-25 citation-dense reference pieces to get LLM-cited. None of that work matters while:
- Cloudflare’s
ai_bots_protection: "block"is firing 403s at ClaudeBot at the WAF. - The managed robots.txt is telling GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Bytespider, Amazonbot, and meta-externalagent to stay off the entire site.
- Even the cooperative bots (OAI-SearchBot, ChatGPT-User) are bouncing off either robots.txt or 404s.
The asymmetric upside the Apr 22 brief identified — citation-dense writing helps low-authority domains 3-5x more than high-authority ones — applies to content that LLMs can actually read. We are not in that regime.
Decision needed: opt in or stay out? This is a real tradeoff, not a no-brainer:
- Argument for staying blocked: training-data extraction without compensation is the same dynamic the founder has flagged in
~/rdco-vault/06-reference/2026-04-19-michael-dean-elon-writing-prize-scheme.md. The Cloudflare default exists because most operators don’t want their content training competitor models for free. The published Sanity Check pieces have value; giving them to Anthropic/OpenAI to train future models is a real cost. - Argument for opting in: the GEO thesis from the Apr 22 brief is explicitly about getting cited, not getting trained. The right configuration is
Content-Signal: search=yes, ai-train=no(already set) PLUS allowing search-time crawlers (OAI-SearchBot, PerplexityBot, ClaudeBot if Anthropic respects ai-train=no, Claude-SearchBot) while still blocking training crawlers (GPTBot, CCBot, Bytespider). The current managed list is too aggressive — it disallows ClaudeBot wholesale, but ClaudeBot’s 2026 documented behavior is that it respectsContent-Signaldirectives, so aDisallow: /is a sledgehammer.
Concrete next moves (recommend, do not auto-execute):
-
Decide the policy. Founder picks between: (a) keep the current Cloudflare defaults, accept GEO bet is dead; (b) opt in to AI search/citation but not training — disable
ai_bots_protection, replace managed robots.txt with a hand-written file that allows search-time crawlers and theContent-Signal: search=yes, ai-train=nodirective; (c) opt in fully — allow training too, treat raydata.co content as deliberately given to the commons. Recommend option (b). -
Fix the content-routing bug regardless of policy. The 404s on
/p/getting-weird-with-squarelyand/p/analytics-antenna-and-good-guy-amazonare dropping live ChatGPT users into a hole right now. Add/p/*->/articles/*301 redirects in the Vercel Next.js config. This is reversible, has zero downside, and rescues real user traffic. -
If opting in (option b): remove
is_robots_txt_managed, write our own/public/robots.txtwith explicit Allow for OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-User, Claude-SearchBot, Googlebot, Bingbot — and explicit Disallow for GPTBot, CCBot, Bytespider, Amazonbot. Setai_bots_protectiontodisabled. Then deploy/llms.txtper the Apr 22 publishing-for-agents spec. Then re-run this empirical audit in 30 days to verify content URLs are being fetched. -
Decide whether to actually rebuild raydata.co before opting in. Per the Brand Architecture doc, the current site is the older “Mr. Ben / Data, DAGs, and Doodles” personal site awaiting rebuild as the umbrella. Opening the gates BEFORE the rebuild means LLMs train/index on content the founder explicitly considers stale. Sequence-wise: rebuild umbrella -> opt in -> measure.
Bottom line: GEO can’t be the next thing the founder ships. The unblocking decision (and ideally the umbrella rebuild) has to come first. Estimated effort to unblock: ~2 hours of config work IF the policy is decided. Estimated effort to rebuild umbrella first: separate scoping exercise — last estimated at hq.raydata.co-class effort (multi-day).
Open follow-ups
- Should the founder draft a positioning piece on the “ai-train=no, search=yes” Content-Signal stance? It’s an opinionated configuration that could itself become a citation-bait artifact. Worth a brief.
- What does Sanity Check (
sc.raydata.co) look like under the same audit? Different Cloudflare Pages project, possibly different defaults, and Sanity Check is the surface that should most aggressively want LLM citation. Run the same query against the sc subdomain. - Does Cloudflare’s Free plan expose enough analytics granularity for a recurring monthly crawler-visibility audit, or does this require Pro plan ($20/mo) for proper Bot Analytics? Need to test once
ai_bots_protectionis disabled and there’s actual AI bot traffic to measure. - Once the robots.txt is hand-written, what’s the right Content-Signal combination for
User-agent: *vs per-bot rules? The defaultsearch=yes, ai-train=nomay be too broad; e.g., we may want to allow Anthropic/OpenAI ai-input (RAG/grounding) but not ai-train. - The
/p/*->/articles/*404s suggest a broader URL-stability problem. Worth a vault audit of inbound links from external sources (X posts, newsletter shares) to see how many other dead URLs are circulating.
Sources
- Empirical: Cloudflare GraphQL Analytics API, zone
125cd3e614d6fd3c369887fb3081bfd5(raydata.co),httpRequestsAdaptiveGroupsfiltered byuserAgent_likeper AI bot UA, queried 2026-05-09 ~01:35 EDT. - Empirical: Cloudflare Bot Management settings endpoint
/zones/<id>/bot_management, 2026-05-09. - Empirical:
curl https://www.raydata.co/robots.txt, 2026-05-09. - Cloudflare — Block AI bots docs
- Cloudflare — managed robots.txt setting
- Cloudflare blog — Control content use for AI training
- Cloudflare blog — Declare your AIndependence (one-click block)
- Cloudflare AI Crawl Control overview
- OpenAI — Overview of OpenAI Crawlers
- Am I Cited — GPTBot vs OAI-SearchBot
- Momentic — Top AI Search Crawlers + User Agents
- Longato — llms.txt 2025 audit
- Apr 22 GEO state-of-discipline brief
- Apr 22 Publishing for Agents spec
- Brand Architecture