How to audit whether your site is crawlable by AI bots

April 13, 2026 in ai-visibility·6 min read
AI crawlability audit procedure

How to audit whether your site is crawlable by AI bots

Most sites think they are crawlable by AI bots and are not. This is the most common technical issue we find at the start of a new engagement, and it is almost always invisible until someone checks. A Cloudflare rule got turned on a year ago and nobody documented it. Rate limits block slow-fetching bots. robots.txt has rules from 2023. JavaScript rendering hides content. CDN caching prevents recrawling. Any one of those silently tanks AI visibility. Together they make a site invisible to ChatGPT, Claude, and Perplexity no matter how good the content is. Below is the 2-hour audit we run on every new client.

Why this is the first thing to check

If the bots cannot reach your pages, nothing else matters. Content, schema, llms.txt files, Reddit seeding, PR placements: all of it presumes a crawler can fetch the page. We routinely find clients who spent six months on content investment, saw no movement in AI citations, and discovered on audit that Cloudflare had been serving GPTBot a 403 Forbidden the entire time. Running this audit on day one catches these issues before any content investment goes to waste.

Common blockers we find

Five blockers account for almost every failure:

  1. Cloudflare's "Block AI Scrapers and Crawlers" toggle. Cloudflare made this a one-click feature, and blocking it does not affect Google Search ranking, so the button looks safe, but it blocks GPTBot, ClaudeBot, PerplexityBot, and others with a single setting.
  2. Server rate limits that throttle slow-fetching bots.
  3. Outdated robots.txt rules. A rule written in 2023 to block GPTBot may still be in production.
  4. JavaScript rendering: crawlers that do not run JS see an empty shell.
  5. CDN caching headers that force stale content or prevent recrawling.

Each of these is easy to fix once you know it exists. None of them are visible from the outside without a specific test.

Step 1: Check robots.txt against the current AI bot list

Pull your robots.txt and compare it against the current AI bot list. At minimum you should have explicit rules for GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, and Bytespider. The full list with recommended defaults is in the AI bots robots.txt guide. The two most common mistakes: a blanket Disallow rule for User-agent: * that accidentally catches new AI bots, and contradictory rules where one section allows GPTBot and another blocks it. Clean up both before moving on.

Step 2: Test server response to bot user agents

Use curl to request your homepage and three deep content pages with each AI bot user agent. Test at least GPTBot, ClaudeBot, and PerplexityBot. You are looking for a 200 OK with the full HTML body. A 403 Forbidden means something upstream is blocking the bot. A 429 means rate limiting is throttling it. A 200 with empty body means the server is responding but JavaScript rendering is hiding the content. If any bot comes back non-200, find the source. It is usually one of the next three steps.

Step 3: Check Cloudflare or WAF rules

Log into Cloudflare (or your WAF) and check bot settings. Look for the "Block AI Scrapers and Crawlers" toggle, custom WAF rules matching AI bot user agents, and firewall rules blocking IP ranges. OpenAI publishes GPTBot IP ranges at openai.com/gptbot.json. Anthropic publishes crawler IP ranges at claude.com/crawling/bots.json. Cross-reference these against your allow and block lists. Perplexity is a special case. In August 2025 Cloudflare published a report on Perplexity stealth crawlers that rotate user agent, IP, and ASN to bypass robots.txt. Your choice is to block it explicitly or allow it knowing Perplexity's retrieval depends on it. Make the choice deliberately.

Step 4: Verify CDN cache behavior

CDN caching is the subtle one. If your CDN serves cached responses for a very long TTL, bots get stale content and the index never updates. Check cache-control headers on your most important pages. For blog posts, a TTL of a few hours is usually fine. Avoid serving month-old cached responses to every bot, because indexing engines interpret that as "nothing has changed" and deprioritize. Also check the Vary header: if your CDN serves different content based on User-Agent without the correct Vary header, crawlers can receive mismatched cached responses.

Step 5: Check JavaScript-rendered pages

For any page that depends on JavaScript to render its main content, test whether the HTML returned to a non-JS crawler contains actual content or an empty shell. Google's Rich Results Test shows what Googlebot sees. For other bots, curl the page with a bot user agent and inspect the raw HTML. If content is only available after JS executes, most AI crawlers will not see it. The fix is server-side rendering, pre-rendering, or inlining critical content into the initial HTML payload. JavaScript rendering is the blocker that takes the longest to fix because it requires engineering work, but it is also the blocker that causes the biggest invisible drop in AI visibility.

Bytespider and interpreting unexpected bot traffic

HAProxy published a 2024 analysis showing nearly 90 percent of AI crawler traffic came from Bytespider, the ByteDance (TikTok) crawler. If your logs show unexpectedly heavy AI crawler load, Bytespider is usually the culprit. It is aggressive and many operators block it. CCBot (Common Crawl) respects robots.txt and supports Crawl-delay. Google-Extended is Google's separate AI crawler; blocking it does not affect classical Google Search ranking. Read your logs. Most operators never do, and in about 10 minutes they will tell you which bots are visiting, which are blocked, and which are getting through.

What to fix first

Fix in this order:

  1. Lift any Cloudflare or WAF blocks on GPTBot, ClaudeBot, and PerplexityBot.
  2. Update robots.txt to match the current bot list with your chosen stance.
  3. Fix any 403s or 429s on bot user agents.
  4. Verify CDN cache headers are sensible.
  5. Fix JavaScript rendering on your most important pages.

The first two steps alone catch more than half of real-world crawlability problems.

Conclusion

A crawlability audit is the least glamorous work in AI visibility and the highest leverage. The audit takes two hours and finds issues that have been silently killing AI citations for months. The fixes are usually small: a toggle in Cloudflare, a line in robots.txt, a server-side rendering patch. What the audit buys you is permission for every other GEO investment to pay off.

How Soar saves you time and money

A crawlability audit is the first two hours of every engagement we run. The number of clients who discover they have been accidentally blocking GPTBot or ClaudeBot in production is surprisingly high, usually because a Cloudflare WAF rule got turned on without documentation. A 2-hour audit finds these issues and fixes the easy wins before any content work starts. It saves months of wondering why AI visibility is not improving. We have run this audit hundreds of times and know the common failure modes on AWS, Vercel, Cloudflare, and Fastly, which makes our diagnostic pass much faster than an in-house team running it for the first time.

The practical outcome for most clients is that we find one to three blockers in the first audit, fix them in week one, and the AI visibility metric starts moving before any content work has started. It is the cheapest and fastest intervention we offer. Request a proposal and we will run the audit on your site as the first deliverable.

Community marketing strategy

Ready to grow through community marketing?

Get a custom strategy tailored to your brand, audience, and the conversations already shaping buying decisions.