06-reference / research

raydata co bot crawler logs

Fri May 08 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·research-brief ·source: deep-research
geoai-crawlersraydata-cocloudflare-analyticsrobots-txtgptbotclaudebotperplexitybotoai-searchbotchatgpt-user

raydata.co AI crawler status — empirical brief

The question

What do GPTBot, ClaudeBot, and PerplexityBot logs actually show for raydata.co — is the site being crawled, at what frequency, which pages are being indexed, and are there any robots.txt or Cloudflare WAF rules blocking AI crawlers?

Headline finding

raydata.co is actively blocking AI crawlers at two layers — Cloudflare’s managed Disallow: / robots.txt for ten AI bot user-agents, AND zone-level ai_bots_protection: "block" that 403s ClaudeBot at the WAF edge. GPTBot, PerplexityBot, Google-Extended, and CCBot show zero hits in the last 7 days. ClaudeBot and OAI-SearchBot probe /robots.txt daily, see Disallow: /, and walk away (16-26 hits/day for ClaudeBot, ~10/day for OAI-SearchBot). The only AI crawler reaching content pages is ChatGPT-User — the user-initiated live-retrieval bot (~9 hits over 7 days) — which is a useful visibility signal, but two of its four content fetches were 404s because old /p/... URLs have been replaced with /articles/... paths and ChatGPT still has the dead URLs cached.

GEO is not unblocked. Step 0 fails. None of the Princeton-validated techniques can deliver lift while the site is blocking AI training crawlers and search-time indexing crawlers at the WAF.

What we already know (from the vault)

What the web says

Empirical findings — raydata.co specifically

All data pulled 2026-05-09 ~01:35 EDT via Cloudflare GraphQL Analytics API, zone 125cd3e614d6fd3c369887fb3081bfd5 (raydata.co), windowed to 2026-05-02 00:00 UTC through 2026-05-09 05:00 UTC.

Configuration state (zone settings)

ai_bots_protection      = block          # ENFORCING - returns 403 at edge for AI bots
content_bots_protection = disabled
crawler_protection      = disabled
is_robots_txt_managed   = true           # Cloudflare auto-injects Disallow:/ block
cf_robots_variant       = off
fight_mode              = false          # generic Bot Fight Mode is off
plan                    = Free Website

Live robots.txt body served at raydata.co/robots.txt

Cloudflare-managed content (200 OK on GET, 1738 bytes). Disallows ten AI crawlers explicitly:

User-agent: *
Content-Signal: search=yes,ai-train=no
Allow: /

User-agent: Amazonbot                Disallow: /
User-agent: Applebot-Extended        Disallow: /
User-agent: Bytespider               Disallow: /
User-agent: CCBot                    Disallow: /
User-agent: ClaudeBot                Disallow: /
User-agent: CloudflareBrowserRenderingCrawler  Disallow: /
User-agent: Google-Extended          Disallow: /
User-agent: GPTBot                   Disallow: /
User-agent: meta-externalagent       Disallow: /

Note: PerplexityBot, OAI-SearchBot, ChatGPT-User, Claude-User, and Claude-SearchBot are NOT in the managed disallow list. They get the wildcard User-agent: * rule with Content-Signal: search=yes, ai-train=no, Allow: / — i.e., search indexing is permitted but training is not.

Side note: at the apex https://raydata.co the robots.txt 308-redirects to https://www.raydata.co/robots.txt. A HEAD to /robots.txt returns 404 with a Vercel 404 page (the underlying Vercel project has no robots.txt route), but a GET is intercepted by Cloudflare and returns the managed body. The 200 GET is what crawlers actually see.

AI crawler hits per day (last 7 days)

Date (UTC)GPTBotClaudeBotPerplexityBotChatGPT-UserOAI-SearchBotGoogle-ExtCCBot
2026-05-02020001100
2026-05-03016001100
2026-05-04023001200
2026-05-05018081000
2026-05-06026041101
2026-05-07016001001
2026-05-080201900
7-day total01210137402

Path-level breakdown

ClaudeBot: 100% of hits are /robots.txt (200 OK with the disallow body). Zero content-page fetches. Anthropic’s crawler is doing its job — it reads robots.txt every few hours and respects the Disallow: /.

OAI-SearchBot: 100% of hits are /robots.txt (200 OK). Despite NOT being in the disallow list, OpenAI’s search crawler is also only reading robots.txt and not crawling content. Possibly because its general policy is to back off from sites where the broader OpenAI suite (GPTBot) is disallowed. Possibly because raydata.co has no inbound links pointing it at content URLs to fetch. Either way: zero content indexing.

ChatGPT-User (live retrievals — the actually interesting signal):

DatePathStatusCount
2026-05-08/articles/analytics-crafting-a-way-forward2001
2026-05-06/p/getting-weird-with-squarely4044
2026-05-05/p/analytics-antenna-and-good-guy-amazon4041
2026-05-05/2004
2026-05-05/3083

ChatGPT users prompted the model 13 times in the last week to fetch raydata.co content. 5 of those 13 requests went to dead URLs (/p/... was the old Substack-style path; live URLs are /articles/...). The ChatGPT memory layer has stale URLs from before the path migration.

WAF 403 enforcement — confirmed at edge:

ClaudeBot is being 403’d when it tries to fetch sitemap files. Daily counts of ClaudeBot 403s on /sitemap.xml + /sitemap-index.xml:

Datesitemap.xml 403ssitemap-index.xml 403s
2026-05-022316
2026-05-031615
2026-05-042714
2026-05-051212

This proves ai_bots_protection: "block" is actually returning 403s — not just modifying robots.txt. Even if we removed ClaudeBot from the disallow list, the WAF rule would still block it.

GPTBot, PerplexityBot, Google-Extended: zero requests of any kind in 7 days. They read robots.txt at most rarely (or not at all in this window) and never attempted content URLs. PerplexityBot in particular appears to not be actively crawling raydata.co at all.

Other relevant traffic patterns

The last 24h top-30 user-agents are dominated by:

The site is alive and being hit, just not by AI training/search crawlers.

Convergences and contradictions

Convergences

Contradictions / surprises

Synthesis for RDCO

The GEO bet, as currently framed, is blocked at Step 0. The Apr 22 brief recommended writing 15-25 citation-dense reference pieces to get LLM-cited. None of that work matters while:

  1. Cloudflare’s ai_bots_protection: "block" is firing 403s at ClaudeBot at the WAF.
  2. The managed robots.txt is telling GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Bytespider, Amazonbot, and meta-externalagent to stay off the entire site.
  3. Even the cooperative bots (OAI-SearchBot, ChatGPT-User) are bouncing off either robots.txt or 404s.

The asymmetric upside the Apr 22 brief identified — citation-dense writing helps low-authority domains 3-5x more than high-authority ones — applies to content that LLMs can actually read. We are not in that regime.

Decision needed: opt in or stay out? This is a real tradeoff, not a no-brainer:

Concrete next moves (recommend, do not auto-execute):

  1. Decide the policy. Founder picks between: (a) keep the current Cloudflare defaults, accept GEO bet is dead; (b) opt in to AI search/citation but not training — disable ai_bots_protection, replace managed robots.txt with a hand-written file that allows search-time crawlers and the Content-Signal: search=yes, ai-train=no directive; (c) opt in fully — allow training too, treat raydata.co content as deliberately given to the commons. Recommend option (b).

  2. Fix the content-routing bug regardless of policy. The 404s on /p/getting-weird-with-squarely and /p/analytics-antenna-and-good-guy-amazon are dropping live ChatGPT users into a hole right now. Add /p/* -> /articles/* 301 redirects in the Vercel Next.js config. This is reversible, has zero downside, and rescues real user traffic.

  3. If opting in (option b): remove is_robots_txt_managed, write our own /public/robots.txt with explicit Allow for OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-User, Claude-SearchBot, Googlebot, Bingbot — and explicit Disallow for GPTBot, CCBot, Bytespider, Amazonbot. Set ai_bots_protection to disabled. Then deploy /llms.txt per the Apr 22 publishing-for-agents spec. Then re-run this empirical audit in 30 days to verify content URLs are being fetched.

  4. Decide whether to actually rebuild raydata.co before opting in. Per the Brand Architecture doc, the current site is the older “Mr. Ben / Data, DAGs, and Doodles” personal site awaiting rebuild as the umbrella. Opening the gates BEFORE the rebuild means LLMs train/index on content the founder explicitly considers stale. Sequence-wise: rebuild umbrella -> opt in -> measure.

Bottom line: GEO can’t be the next thing the founder ships. The unblocking decision (and ideally the umbrella rebuild) has to come first. Estimated effort to unblock: ~2 hours of config work IF the policy is decided. Estimated effort to rebuild umbrella first: separate scoping exercise — last estimated at hq.raydata.co-class effort (multi-day).

Open follow-ups

Sources