How AI bots crawl your site: a robots.txt guide for GPTBot, ClaudeBot, and PerplexityBot

April 13, 2026 in ai-visibility·10 min read
How AI bots crawl your site: a robots.txt guide

How AI bots crawl your site: a robots.txt guide for GPTBot, ClaudeBot, and PerplexityBot

Most brands' robots.txt files are three years behind the current AI crawler landscape. They have a GPTBot rule, maybe, and nothing else. In 2026, the modern robots.txt needs rules for a dozen AI-specific bots across OpenAI, Anthropic, Perplexity, Google, Common Crawl, and ByteDance, with explicit policies for the ones that ignore the file entirely. This post walks through every bot, who owns it, whether it honors robots.txt, the exact user-agent string, and the rule you need. At the end there is a full working example you can copy, plus the server-level rules for bots that cannot be blocked any other way.

The AI bot landscape at a glance

Bot Owner Purpose Respects robots.txt
GPTBot OpenAI Training data Yes
OAI-SearchBot OpenAI ChatGPT Search Yes
ChatGPT-User OpenAI User-initiated fetch Yes
ClaudeBot Anthropic Training data Yes
Claude-User Anthropic User-initiated fetch Yes
Claude-SearchBot Anthropic In-product search index Yes
PerplexityBot Perplexity Indexing for answers Yes, with stealth crawler caveat
Perplexity-User Perplexity User-facing fetch Claims "agent, not bot"
Google-Extended Google Training Gemini Yes, does not affect Search ranking
CCBot Common Crawl Public corpus Yes
Bytespider ByteDance Training Doubao Documented non-compliance

Why AI bots are a separate conversation from search bots

Classical search crawlers like Googlebot and Bingbot exist to index pages for ten-blue-links search. Their behavior is well-documented, their IP ranges are published, and they honor robots.txt reliably. AI crawlers are messier. Some are training-only and do not feed live search. Some are user-initiated and run on demand from a chat session. Some rotate user-agents to hide from the bots they claim they are not. Blocking them incorrectly costs you AI visibility without you noticing. Allowing them incorrectly lets your content train a competitor's model. Getting the rules right is the entire point of this post.

OpenAI bots: GPTBot, OAI-SearchBot, ChatGPT-User

OpenAI operates three distinct crawlers, all documented at platform.openai.com/docs/bots. Each has a specific job and a specific user-agent.

GPTBot/1.3 is the training crawler. It collects pages that feed future model pre-training. Blocking it means your content is less likely to appear in future GPT versions' baseline knowledge. Full user-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot). IP ranges are published at openai.com/gptbot.json. Honors robots.txt.

OAI-SearchBot/1.0 is the ChatGPT Search crawler. It feeds the real-time search feature that was launched on October 31, 2024. This is the bot you want crawling your site if you care about ChatGPT citations, because blocking it removes you from ChatGPT Search results. IP ranges at openai.com/searchbot.json. Honors robots.txt.

ChatGPT-User/2.0 is the user-initiated fetcher. It runs on demand when a ChatGPT user asks about a specific URL or opens a browsing tool mid-conversation. It is fundamentally different from the other two because there is no persistent crawl. Every request is tied to a specific user action. IP ranges at openai.com/chatgpt-user.json. Honors robots.txt.

The practical implication is that you almost always want all three allowed. Blocking GPTBot and keeping OAI-SearchBot is a defensible choice if you want to prevent training use while still appearing in ChatGPT Search. Blocking ChatGPT-User is almost never the right call, because it breaks a user-initiated fetch that the user explicitly asked for.

Anthropic bots: ClaudeBot, Claude-User, Claude-SearchBot

Anthropic's crawler documentation lives in the Claude privacy support docs. Anthropic publicly commits to honoring robots.txt across all its crawlers and not bypassing access controls. The lineup mirrors OpenAI's.

ClaudeBot is the training crawler. It feeds Anthropic's model pre-training corpus. Honors robots.txt.

Claude-User is the user-initiated fetcher. Runs when a Claude user asks the model to fetch or analyze a specific page. Honors robots.txt.

Claude-SearchBot is the in-product search crawler that supports Claude's web search tool. Honors robots.txt.

Two deprecated user-agents still show up in old robots.txt examples: Claude-Web and anthropic-ai. Anthropic no longer uses either, but if your robots.txt already has rules for them there is no harm in leaving the rules in place as a compatibility layer.

Perplexity bots: PerplexityBot, Perplexity-User, and the stealth crawler problem

Perplexity documents two crawlers: PerplexityBot (a traditional crawler for the search index) and Perplexity-User (a user-initiated fetcher). Perplexity has stated that Perplexity-User "is an agent, not a bot" and therefore is not required to honor robots.txt, which is a position that has caused real disputes with publishers and Cloudflare.

The bigger problem is that Perplexity has been caught running crawlers that are not declared in either of those user-agents. On August 4, 2025, Cloudflare published a detailed report showing Perplexity using undeclared crawlers that rotate user-agents, IPs, and ASNs to evade no-crawl directives. Cloudflare's conclusion was blunt: robots.txt rules are not a reliable defense against Perplexity if Perplexity does not want to respect them. Blocking at the server or WAF level is the only real control.

Our default is to allow PerplexityBot in robots.txt for brands that want AI visibility, and to keep server-level WAF rules ready in case of abuse. Blocking at both layers is appropriate for brands with sensitive content that they do not want appearing in Perplexity answers at all.

Google introduced Google-Extended on September 28, 2023. It is the user-agent you use to control whether Google can use your content to train Gemini and other generative models. Critically, blocking Google-Extended does NOT affect your traditional Google Search ranking. That is the one thing most brands get wrong about this bot. Googlebot is the one that reads your site for Search. Google-Extended is the separate one that controls training data. They are distinct.

This matters because blocking Google-Extended is a rare case where you can opt out of AI training without paying a search-ranking penalty. If you have legal or IP concerns about Gemini training on your content, add a disallow rule for Google-Extended and leave Googlebot untouched. Your Search rankings will not change.

CCBot: Common Crawl, the indirect training source

CCBot is the Common Crawl crawler. Common Crawl is a nonprofit that operates a public web archive, and its data has been used to train most of the large LLMs including GPT-3, LLaMA, and many open-source models. Blocking CCBot is an indirect way of opting out of training data for a wide range of models at once. CCBot honors robots.txt and supports the Crawl-delay directive.

Whether to block CCBot is a strategy call. Allowing it means your content can end up in any model trained on Common Crawl. Blocking it does not prevent OpenAI's GPTBot or Anthropic's ClaudeBot from crawling separately, because those are independent pipelines. If you want to opt out of training broadly, block CCBot along with GPTBot, Google-Extended, and ClaudeBot as a set.

Bytespider: the non-compliant bot you should block at the server

Bytespider is ByteDance's crawler, used to train the Doubao LLM. It has a long documented history of non-compliance with robots.txt. HAProxy reported in 2024 that nearly 90 percent of AI crawler traffic across their customer base came from Bytespider alone, much of it ignoring disallow rules. If you are going to block one bot in this entire list, Bytespider is the one.

Because Bytespider ignores robots.txt, you cannot rely on the file alone. A disallow rule is a first line of defense but not the whole defense. Server-level rules are required, which we cover below.

The complete sample robots.txt

Here is the working starting point we ship for brands that want AI visibility on OpenAI, Anthropic, Perplexity, and Google while blocking the non-compliant bots.

# Allow OpenAI
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /

# Allow Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /

# Allow Perplexity
User-agent: PerplexityBot
Allow: /

# Allow Google training
User-agent: Google-Extended
Allow: /

# Block Common Crawl (optional)
User-agent: CCBot
Disallow: /

# Block Bytespider (recommended due to non-compliance history)
User-agent: Bytespider
Disallow: /

Copy this into your /robots.txt, adjust the allow and disallow decisions based on your policy, and ship it.

Server-level blocking for bots that ignore robots.txt

Robots.txt is a polite request. Bytespider and Perplexity's stealth crawlers have both been documented ignoring it. For bots that do not comply, the defense is at the server layer.

On Cloudflare, the built-in Bot Management rules include an "AI Scrapers and Crawlers" category that covers the common non-compliant agents. Enable it and choose Block or Challenge for the category. On nginx, a deny rule in the config file matching user-agent strings handles the compliant-but-malicious case. For brands without Cloudflare or a nginx front door, a WAF product like AWS WAF or Cloudflare WAF applied at the CDN layer is the only realistic option.

The server-level rules are cheap to add and expensive to forget. We include them in every technical audit we run, because most brands ship robots.txt without realizing their biggest crawler problem is a bot that does not read the file at all.

The block-everything vs allow-everything debate

There are two legitimate default positions.

Block everything is the right call for publishers, media companies, and brands with proprietary content they do not want used as training data. It prioritizes IP protection over AI visibility. Major publishers including The New York Times have moved in this direction. The trade-off is that blocking AI bots also removes you from ChatGPT Search, Claude's web search, and Perplexity's answers, which is a direct AI visibility cost.

Allow everything is the right call for marketing sites, SaaS brands, and anyone whose business model depends on being cited in AI answers. It prioritizes visibility over training control. Most of our client work lands here, with a few exceptions. The trade-off is that you lose control over how your content is used in training, which for some industries is a real concern and for others is a non-issue.

The honest answer is to make the call deliberately, not by accident. Most of the broken robots.txt files we audit are broken because nobody made the decision, not because the decision was wrong. For the audit procedure, read how to audit whether your site is crawlable by AI bots.

Conclusion

Robots.txt in 2026 is not a file you wrote once and forgot. It is a living policy that has to cover 12 bots across 6 organizations, with server-level fallbacks for the ones that ignore the file. The sample config above is a working starting point. The decision about which to allow and which to block is the strategic part, and it belongs to the brand, not to the default plugin that shipped with your CMS. Make the call on purpose.

How Soar saves you time and money

Most brands' robots.txt files are three years behind the current AI bot landscape. We audit and update robots.txt as a standard week-one deliverable in every engagement, covering every bot listed in this post plus the server-level rules for the non-compliant ones. The audit alone typically catches 1 or 2 accidentally-blocked bots that were costing the brand AI visibility without anyone knowing. Fixing them is a five-minute change with an immediate payoff, because the next time ChatGPT or Claude crawls the site, the content starts showing up in answers it could not reach before.

The bigger saving is avoiding the self-inflicted wounds. We have seen brands block GPTBot by copying an outdated example from a 2023 blog post, then wonder why they never appear in ChatGPT Search. We have seen brands leave Bytespider unblocked while it hammers their origin server. Both failures cost real money and both are avoidable with a 30-minute audit. If you want a week-one robots.txt audit as part of a broader AI visibility engagement, request a proposal and we will include it in the scope.

Community marketing strategy

Ready to grow through community marketing?

Get a custom strategy tailored to your brand, audience, and the conversations already shaping buying decisions.