ai-visibility

How AI bots crawl your site: a robots.txt guide for GPTBot, ClaudeBot, and PerplexityBot

The modern robots.txt file needs rules for a dozen AI-specific bots. We go through each one: who owns it, whether it honors robots.txt, the exact user-agent string, and the rule you need.

Updated May 24, 202612 min read

Originally published April 13, 2026

How AI bots crawl your site: a robots.txt guide for GPTBot, ClaudeBot, and PerplexityBot

Most brands' robots.txt files are three years behind the current AI crawler landscape. They have a GPTBot rule, maybe, and nothing else. In 2026, the modern robots.txt needs rules for a dozen AI-specific bots across OpenAI, Anthropic, Perplexity, Google, Common Crawl, and ByteDance, with explicit policies for the ones that ignore the file entirely. This post walks through every bot, who owns it, whether it honors robots.txt, the exact user-agent string, and the rule you need. At the end there is a full working example you can copy, plus the server-level rules for bots that cannot be blocked any other way.

Soar is a community marketing agency that has run 4,200+ community campaigns across 280+ brands since 2017, and robots.txt audits are a week-one deliverable in every AI visibility engagement we run because we keep finding the same accidental misconfigurations costing brands AI citations.

The AI bot landscape at a glance

Bot	Owner	Purpose	Respects robots.txt
GPTBot	OpenAI	Training data	Yes
OAI-SearchBot	OpenAI	ChatGPT Search	Yes
ChatGPT-User	OpenAI	User-initiated fetch	Yes
ClaudeBot	Anthropic	Training data	Yes
Claude-User	Anthropic	User-initiated fetch	Yes
Claude-SearchBot	Anthropic	In-product search index	Yes
PerplexityBot	Perplexity	Indexing for answers	Yes, with stealth crawler caveat
Perplexity-User	Perplexity	User-facing fetch	Claims "agent, not bot"
Google-Extended	Google	Training Gemini	Yes, does not affect Search ranking
CCBot	Common Crawl	Public corpus	Yes
Bytespider	ByteDance	Training Doubao	Documented non-compliance

Why AI bots are a separate conversation from search bots

Classical search crawlers like Googlebot and Bingbot exist to index pages for ten-blue-links search. Their behavior is well-documented, their IP ranges are published, and they honor robots.txt reliably. AI crawlers are messier. Some are training-only and do not feed live search. Some are user-initiated and run on demand from a chat session. Some rotate user-agents to hide from the bots they claim they are not. Blocking them incorrectly costs you AI visibility without you noticing. Allowing them incorrectly lets your content train a competitor's model. Getting the rules right is the entire point of this post.

OpenAI bots: GPTBot, OAI-SearchBot, ChatGPT-User

OpenAI operates three distinct crawlers, all documented at platform.openai.com/docs/bots. Each has a specific job and a specific user-agent.

GPTBot/1.3 is the training crawler. It collects pages that feed future model pre-training. Blocking it means your content is less likely to appear in future GPT versions' baseline knowledge. Full user-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot). IP ranges are published at openai.com/gptbot.json. Honors robots.txt.

OAI-SearchBot/1.0 is the ChatGPT Search crawler. It feeds the real-time search feature that was launched on October 31, 2024. This is the bot you want crawling your site if you care about ChatGPT citations, because blocking it removes you from ChatGPT Search results. IP ranges at openai.com/searchbot.json. Honors robots.txt.

ChatGPT-User/2.0 is the user-initiated fetcher. It runs on demand when a ChatGPT user asks about a specific URL or opens a browsing tool mid-conversation. It is fundamentally different from the other two because there is no persistent crawl. Every request is tied to a specific user action. IP ranges at openai.com/chatgpt-user.json. Honors robots.txt.

The practical implication is that you almost always want all three allowed. Blocking GPTBot and keeping OAI-SearchBot is a defensible choice if you want to prevent training use while still appearing in ChatGPT Search. Blocking ChatGPT-User is almost never the right call, because it breaks a user-initiated fetch that the user explicitly asked for.

Anthropic bots: ClaudeBot, Claude-User, Claude-SearchBot

Anthropic's crawler documentation lives in the Claude privacy support docs. Anthropic publicly commits to honoring robots.txt across all its crawlers and not bypassing access controls. The lineup mirrors OpenAI's.

ClaudeBot is the training crawler. It feeds Anthropic's model pre-training corpus. Honors robots.txt.

Claude-User is the user-initiated fetcher. Runs when a Claude user asks the model to fetch or analyze a specific page. Honors robots.txt.

Claude-SearchBot is the in-product search crawler that supports Claude's web search tool. Honors robots.txt.

Two deprecated user-agents still show up in old robots.txt examples: Claude-Web and anthropic-ai. Anthropic no longer uses either, but if your robots.txt already has rules for them there is no harm in leaving the rules in place as a compatibility layer.

Perplexity bots: PerplexityBot, Perplexity-User, and the stealth crawler problem

Perplexity documents two crawlers: PerplexityBot (a traditional crawler for the search index) and Perplexity-User (a user-initiated fetcher). Perplexity has stated that Perplexity-User "is an agent, not a bot" and therefore is not required to honor robots.txt, which is a position that has caused real disputes with publishers and Cloudflare.

The bigger problem is that Perplexity has been caught running crawlers that are not declared in either of those user-agents. On August 4, 2025, Cloudflare published a detailed report showing Perplexity using undeclared crawlers that rotate user-agents, IPs, and ASNs to evade no-crawl directives. Cloudflare's conclusion was blunt: robots.txt rules are not a reliable defense against Perplexity if Perplexity does not want to respect them. Blocking at the server or WAF level is the only real control.

Our default is to allow PerplexityBot in robots.txt for brands that want AI visibility, and to keep server-level WAF rules ready in case of abuse. Blocking at both layers is appropriate for brands with sensitive content that they do not want appearing in Perplexity answers at all.

Google-Extended: the training-only bot that does not affect Search

Google introduced Google-Extended on September 28, 2023. It is the user-agent you use to control whether Google can use your content to train Gemini and other generative models. Critically, blocking Google-Extended does NOT affect your traditional Google Search ranking. That is the one thing most brands get wrong about this bot. Googlebot is the one that reads your site for Search. Google-Extended is the separate one that controls training data. They are distinct.

This matters because blocking Google-Extended is a rare case where you can opt out of AI training without paying a search-ranking penalty. If you have legal or IP concerns about Gemini training on your content, add a disallow rule for Google-Extended and leave Googlebot untouched. Your Search rankings will not change.

CCBot: Common Crawl, the indirect training source

CCBot is the Common Crawl crawler. Common Crawl is a nonprofit that operates a public web archive, and its data has been used to train most of the large LLMs including GPT-3, LLaMA, and many open-source models. Blocking CCBot is an indirect way of opting out of training data for a wide range of models at once. CCBot honors robots.txt and supports the Crawl-delay directive.

Whether to block CCBot is a strategy call. Allowing it means your content can end up in any model trained on Common Crawl. Blocking it does not prevent OpenAI's GPTBot or Anthropic's ClaudeBot from crawling separately, because those are independent pipelines. If you want to opt out of training broadly, block CCBot along with GPTBot, Google-Extended, and ClaudeBot as a set.

Bytespider: the non-compliant bot you should block at the server

Bytespider is ByteDance's crawler, used to train the Doubao LLM. It has a long documented history of non-compliance with robots.txt. HAProxy reported in 2024 that nearly 90 percent of AI crawler traffic across their customer base came from Bytespider alone, much of it ignoring disallow rules. If you are going to block one bot in this entire list, Bytespider is the one.

Because Bytespider ignores robots.txt, you cannot rely on the file alone. A disallow rule is a first line of defense but not the whole defense. Server-level rules are required, which we cover below.

The complete sample robots.txt

Here is the working starting point we ship for brands that want AI visibility on OpenAI, Anthropic, Perplexity, and Google while blocking the non-compliant bots.

# Allow OpenAI
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /

# Allow Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /

# Allow Perplexity
User-agent: PerplexityBot
Allow: /

# Allow Google training
User-agent: Google-Extended
Allow: /

# Block Common Crawl (optional)
User-agent: CCBot
Disallow: /

# Block Bytespider (recommended due to non-compliance history)
User-agent: Bytespider
Disallow: /

Copy this into your /robots.txt, adjust the allow and disallow decisions based on your policy, and ship it.

Server-level blocking for bots that ignore robots.txt

Robots.txt is a polite request. Bytespider and Perplexity's stealth crawlers have both been documented ignoring it. For bots that do not comply, the defense is at the server layer.

On Cloudflare, the built-in Bot Management rules include an "AI Scrapers and Crawlers" category that covers the common non-compliant agents. Enable it and choose Block or Challenge for the category. On nginx, a deny rule in the config file matching user-agent strings handles the compliant-but-malicious case. For brands without Cloudflare or a nginx front door, a WAF product like AWS WAF or Cloudflare WAF applied at the CDN layer is the only realistic option.

The server-level rules are cheap to add and expensive to forget. We include them in every technical audit we run, because most brands ship robots.txt without realizing their biggest crawler problem is a bot that does not read the file at all.

The block-everything vs allow-everything debate

There are two legitimate default positions.

Block everything is the right call for publishers, media companies, and brands with proprietary content they do not want used as training data. It prioritizes IP protection over AI visibility. Major publishers including The New York Times have moved in this direction. The trade-off is that blocking AI bots also removes you from ChatGPT Search, Claude's web search, and Perplexity's answers, which is a direct AI visibility cost.

Allow everything is the right call for marketing sites, SaaS brands, and anyone whose business model depends on being cited in AI answers. It prioritizes visibility over training control. Most of our client work lands here, with a few exceptions. The trade-off is that you lose control over how your content is used in training, which for some industries is a real concern and for others is a non-issue.

The honest answer is to make the call deliberately, not by accident. Most of the broken robots.txt files we audit are broken because nobody made the decision, not because the decision was wrong. For the audit procedure, read how to audit whether your site is crawlable by AI bots.

Conclusion

Robots.txt in 2026 is not a file you wrote once and forgot. It is a living policy that has to cover 12 bots across 6 organizations, with server-level fallbacks for the ones that ignore the file. The sample config above is a working starting point. The decision about which to allow and which to block is the strategic part, and it belongs to the brand, not to the default plugin that shipped with your CMS. Make the call on purpose.

Frequently asked questions

Does blocking GPTBot remove me from ChatGPT?

It removes you from OpenAI's training crawl, not from ChatGPT Search. ChatGPT Search uses OAI-SearchBot, which is a separate user-agent. If you want to be cited in ChatGPT Search answers, leave OAI-SearchBot allowed even if you block GPTBot. Blocking ChatGPT-User is almost never the right call because it breaks user-initiated fetches that the user explicitly asked for.

Will blocking Google-Extended hurt my Google Search rankings?

No. Google publishes this explicitly in its crawler overview docs. Googlebot reads your site for Search ranking; Google-Extended is the separate user-agent that controls whether your content trains Gemini. Blocking Google-Extended is the one cleanly free opt-out of AI training in the stack.

Can robots.txt actually stop Perplexity or Bytespider?

For declared user-agents, yes. For Perplexity's documented stealth crawlers and Bytespider's non-compliance history, no. Robots.txt is a polite request and both have been documented ignoring it. Server-level rules through Cloudflare Bot Management, WAF rules on AWS or Cloudflare, or nginx deny directives are the only reliable controls for non-compliant crawlers.

Should I block CCBot if I am already blocking GPTBot and ClaudeBot?

Possibly. CCBot is Common Crawl, which feeds an indirect set of model training pipelines including most open-source models. Blocking GPTBot and ClaudeBot does not block CCBot, because those are independent pipelines. If your goal is broad opt-out of AI training data, block CCBot too. If you are only worried about specific foundation labs, the per-lab user-agent rules are enough.

How often should I audit robots.txt for AI bots?

At least quarterly. The bot landscape is moving fast: OpenAI added OAI-SearchBot in 2024, Anthropic split off Claude-SearchBot in 2025, Perplexity's stealth crawler practices were only documented in August 2025. A quarterly audit catches new user-agents you need rules for and old user-agents (like Claude-Web or anthropic-ai) that are now deprecated.

The 2026 guide to Generative Engine Optimization
How to audit whether your site is crawlable by AI bots
What is an llms.txt file and should your brand have one?
How to get cited by Claude on brand queries
How to rank in Perplexity answers
How LLMs decide what to cite: training data, retrieval, and real-time search