• Services
    • Service icon
      Search & AI Visibility
      Get seen in Google, ChatGPT, and Reddit when buyers search.
    • Service icon
      Reputation Management
      Control what people find when they search for your brand.
    • Service icon
      Reddit & Quora Visibility
      Build authentic presence in high-trust communities.
    • Service icon
      Subreddit Building + Management
      Build and run a branded subreddit as a compounding search, AI, and community asset.
    • See all services
  • Solutions
    • Reddit icon
      Reddit marketing service
      Win conversations that rank in search.
    • Discord icon
      Discord Marketing
      Build influence in Discord servers.
    • Quora icon
      Quora marketing service
      Be the expert answer on Quora.
    • Cloud icon
      SaaS Marketing
      Reach developers & decision-makers.
    • Shopping basket icon
      E-commerce Marketing
      Drive sales through communities.
    • AI visibility icon
      AI Visibility Tracking
      Monitor your brand in AI answers.
    • Community icon
      Community Marketing
      Multi-platform trust at scale.
    • Reputation icon
      Reputation Marketing
      Shape search results for your brand.
    • PR icon
      Community PR
      Earned reach in search and AI.
    • See all solutions
  • Case Studies
  • Blog
  • Tools
    • Parse icon
      Parse
      A tool for measuring LLM visibility
    • Subreddit Explorer icon
      Subreddit Explorer
      Explore Reddit communities, rules, and posting patterns
    • Signals icon
      Signals
      DIY community marketing services
Request Proposal
Request proposal
Soar Agency

Win in search, AI, and communities

Β© 2026 AppSoar Inc.
All rights reserved.
Made by a team from
πŸ‡ΊπŸ‡ΈπŸ‡ͺπŸ‡¬πŸ‡ΊπŸ‡¦πŸ‡·πŸ‡ΊπŸ‡΅πŸ‡­
Services
  • Search & AI Visibility
  • Reputation Management
  • Reddit & Quora Visibility
  • Subreddit Building + Management
Solutions
  • Reddit marketing
  • Community marketing
  • AI SEO
  • GEO / SEO
  • Quora marketing
Company
  • Case Studies
  • Contact
  • Terms
  • Privacy
Subreddit Research
  • Overview
  • All Communities
  • Most Cited 2026
  • Methodology
Best For
  • Best subreddits for B2B SaaS teams
  • Best subreddits for developer tools
  • Best subreddits for ecommerce brands
  • Best subreddits for fintech brands
  • Best subreddits for health and wellness brands
  • Best subreddits for productivity tools
  • Best subreddits for gaming and app launches
  • Best subreddits for consumer brands
  • Best subreddits for edtech brands
  • Best subreddits for crypto brands
ai-visibility

How AI bots crawl your site: a robots.txt guide for GPTBot, ClaudeBot, and PerplexityBot

The modern robots.txt file needs rules for a dozen AI-specific bots. We go through each one: who owns it, whether it honors robots.txt, the exact user-agent string, and the rule you need.

Updated April 13, 202610 min read

On this page

  • The AI bot landscape at a glance
  • Why AI bots are a separate conversation from search bots
  • OpenAI bots: GPTBot, OAI-SearchBot, ChatGPT-User
  • Anthropic bots: ClaudeBot, Claude-User, Claude-SearchBot
  • Perplexity bots: PerplexityBot, Perplexity-User, and the stealth crawler problem
  • Google-Extended: the training-only bot that does not affect Search
  • CCBot: Common Crawl, the indirect training source
  • Bytespider: the non-compliant bot you should block at the server
  • The complete sample robots.txt
  • Server-level blocking for bots that ignore robots.txt
  • The block-everything vs allow-everything debate
  • Conclusion
  • How Soar saves you time and money
  • Related reading
How AI bots crawl your site: a robots.txt guide

Most brands' robots.txt files are three years behind the current AI crawler landscape. They have a GPTBot rule, maybe, and nothing else. In 2026, the modern robots.txt needs rules for a dozen AI-specific bots across OpenAI, Anthropic, Perplexity, Google, Common Crawl, and ByteDance, with explicit policies for the ones that ignore the file entirely. This post walks through every bot, who owns it, whether it honors robots.txt, the exact user-agent string, and the rule you need. At the end there is a full working example you can copy, plus the server-level rules for bots that cannot be blocked any other way.

The AI bot landscape at a glance

BotOwnerPurposeRespects robots.txt
GPTBotOpenAITraining dataYes
OAI-SearchBotOpenAIChatGPT SearchYes
ChatGPT-UserOpenAIUser-initiated fetchYes
ClaudeBotAnthropicTraining dataYes
Claude-UserAnthropicUser-initiated fetchYes
Claude-SearchBotAnthropicIn-product search indexYes
PerplexityBotPerplexityIndexing for answersYes, with stealth crawler caveat
Perplexity-UserPerplexityUser-facing fetchClaims "agent, not bot"
Google-ExtendedGoogleTraining GeminiYes, does not affect Search ranking
CCBotCommon CrawlPublic corpusYes
BytespiderByteDanceTraining DoubaoDocumented non-compliance

Why AI bots are a separate conversation from search bots

Classical search crawlers like Googlebot and Bingbot exist to index pages for ten-blue-links search. Their behavior is well-documented, their IP ranges are published, and they honor robots.txt reliably. AI crawlers are messier. Some are training-only and do not feed live search. Some are user-initiated and run on demand from a chat session. Some rotate user-agents to hide from the bots they claim they are not. Blocking them incorrectly costs you AI visibility without you noticing. Allowing them incorrectly lets your content train a competitor's model. Getting the rules right is the entire point of this post.

OpenAI bots: GPTBot, OAI-SearchBot, ChatGPT-User

OpenAI operates three distinct crawlers, all documented at platform.openai.com/docs/bots. Each has a specific job and a specific user-agent.

GPTBot/1.3 is the training crawler. It collects pages that feed future model pre-training. Blocking it means your content is less likely to appear in future GPT versions' baseline knowledge. Full user-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot). IP ranges are published at openai.com/gptbot.json. Honors robots.txt.

OAI-SearchBot/1.0 is the ChatGPT Search crawler. It feeds the real-time search feature that was launched on October 31, 2024. This is the bot you want crawling your site if you care about ChatGPT citations, because blocking it removes you from ChatGPT Search results. IP ranges at openai.com/searchbot.json. Honors robots.txt.

ChatGPT-User/2.0 is the user-initiated fetcher. It runs on demand when a ChatGPT user asks about a specific URL or opens a browsing tool mid-conversation. It is fundamentally different from the other two because there is no persistent crawl. Every request is tied to a specific user action. IP ranges at openai.com/chatgpt-user.json. Honors robots.txt.

The practical implication is that you almost always want all three allowed. Blocking GPTBot and keeping OAI-SearchBot is a defensible choice if you want to prevent training use while still appearing in ChatGPT Search. Blocking ChatGPT-User is almost never the right call, because it breaks a user-initiated fetch that the user explicitly asked for.

Anthropic bots: ClaudeBot, Claude-User, Claude-SearchBot

Anthropic's crawler documentation lives in the Claude privacy support docs. Anthropic publicly commits to honoring robots.txt across all its crawlers and not bypassing access controls. The lineup mirrors OpenAI's.

ClaudeBot is the training crawler. It feeds Anthropic's model pre-training corpus. Honors robots.txt.

Claude-User is the user-initiated fetcher. Runs when a Claude user asks the model to fetch or analyze a specific page. Honors robots.txt.

Claude-SearchBot is the in-product search crawler that supports Claude's web search tool. Honors robots.txt.

Two deprecated user-agents still show up in old robots.txt examples: Claude-Web and anthropic-ai. Anthropic no longer uses either, but if your robots.txt already has rules for them there is no harm in leaving the rules in place as a compatibility layer.

Perplexity bots: PerplexityBot, Perplexity-User, and the stealth crawler problem

Perplexity documents two crawlers: PerplexityBot (a traditional crawler for the search index) and Perplexity-User (a user-initiated fetcher). Perplexity has stated that Perplexity-User "is an agent, not a bot" and therefore is not required to honor robots.txt, which is a position that has caused real disputes with publishers and Cloudflare.

The bigger problem is that Perplexity has been caught running crawlers that are not declared in either of those user-agents. On August 4, 2025, Cloudflare published a detailed report showing Perplexity using undeclared crawlers that rotate user-agents, IPs, and ASNs to evade no-crawl directives. Cloudflare's conclusion was blunt: robots.txt rules are not a reliable defense against Perplexity if Perplexity does not want to respect them. Blocking at the server or WAF level is the only real control.

Our default is to allow PerplexityBot in robots.txt for brands that want AI visibility, and to keep server-level WAF rules ready in case of abuse. Blocking at both layers is appropriate for brands with sensitive content that they do not want appearing in Perplexity answers at all.

Google-Extended: the training-only bot that does not affect Search

Google introduced Google-Extended on September 28, 2023. It is the user-agent you use to control whether Google can use your content to train Gemini and other generative models. Critically, blocking Google-Extended does NOT affect your traditional Google Search ranking. That is the one thing most brands get wrong about this bot. Googlebot is the one that reads your site for Search. Google-Extended is the separate one that controls training data. They are distinct.

This matters because blocking Google-Extended is a rare case where you can opt out of AI training without paying a search-ranking penalty. If you have legal or IP concerns about Gemini training on your content, add a disallow rule for Google-Extended and leave Googlebot untouched. Your Search rankings will not change.

CCBot: Common Crawl, the indirect training source

CCBot is the Common Crawl crawler. Common Crawl is a nonprofit that operates a public web archive, and its data has been used to train most of the large LLMs including GPT-3, LLaMA, and many open-source models. Blocking CCBot is an indirect way of opting out of training data for a wide range of models at once. CCBot honors robots.txt and supports the Crawl-delay directive.

Whether to block CCBot is a strategy call. Allowing it means your content can end up in any model trained on Common Crawl. Blocking it does not prevent OpenAI's GPTBot or Anthropic's ClaudeBot from crawling separately, because those are independent pipelines. If you want to opt out of training broadly, block CCBot along with GPTBot, Google-Extended, and ClaudeBot as a set.

Bytespider: the non-compliant bot you should block at the server

Bytespider is ByteDance's crawler, used to train the Doubao LLM. It has a long documented history of non-compliance with robots.txt. HAProxy reported in 2024 that nearly 90 percent of AI crawler traffic across their customer base came from Bytespider alone, much of it ignoring disallow rules. If you are going to block one bot in this entire list, Bytespider is the one.

Because Bytespider ignores robots.txt, you cannot rely on the file alone. A disallow rule is a first line of defense but not the whole defense. Server-level rules are required, which we cover below.

The complete sample robots.txt

Here is the working starting point we ship for brands that want AI visibility on OpenAI, Anthropic, Perplexity, and Google while blocking the non-compliant bots.

<code># Allow OpenAI
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /

# Allow Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /

# Allow Perplexity
User-agent: PerplexityBot
Allow: /

# Allow Google training
User-agent: Google-Extended
Allow: /

# Block Common Crawl (optional)
User-agent: CCBot
Disallow: /

# Block Bytespider (recommended due to non-compliance history)
User-agent: Bytespider
Disallow: /
</code>

Copy this into your /robots.txt, adjust the allow and disallow decisions based on your policy, and ship it.

Server-level blocking for bots that ignore robots.txt

Robots.txt is a polite request. Bytespider and Perplexity's stealth crawlers have both been documented ignoring it. For bots that do not comply, the defense is at the server layer.

On Cloudflare, the built-in Bot Management rules include an "AI Scrapers and Crawlers" category that covers the common non-compliant agents. Enable it and choose Block or Challenge for the category. On nginx, a deny rule in the config file matching user-agent strings handles the compliant-but-malicious case. For brands without Cloudflare or a nginx front door, a WAF product like AWS WAF or Cloudflare WAF applied at the CDN layer is the only realistic option.

The server-level rules are cheap to add and expensive to forget. We include them in every technical audit we run, because most brands ship robots.txt without realizing their biggest crawler problem is a bot that does not read the file at all.

The block-everything vs allow-everything debate

There are two legitimate default positions.

Block everything is the right call for publishers, media companies, and brands with proprietary content they do not want used as training data. It prioritizes IP protection over AI visibility. Major publishers including The New York Times have moved in this direction. The trade-off is that blocking AI bots also removes you from ChatGPT Search, Claude's web search, and Perplexity's answers, which is a direct AI visibility cost.

Allow everything is the right call for marketing sites, SaaS brands, and anyone whose business model depends on being cited in AI answers. It prioritizes visibility over training control. Most of our client work lands here, with a few exceptions. The trade-off is that you lose control over how your content is used in training, which for some industries is a real concern and for others is a non-issue.

The honest answer is to make the call deliberately, not by accident. Most of the broken robots.txt files we audit are broken because nobody made the decision, not because the decision was wrong. For the audit procedure, read how to audit whether your site is crawlable by AI bots.

Conclusion

Robots.txt in 2026 is not a file you wrote once and forgot. It is a living policy that has to cover 12 bots across 6 organizations, with server-level fallbacks for the ones that ignore the file. The sample config above is a working starting point. The decision about which to allow and which to block is the strategic part, and it belongs to the brand, not to the default plugin that shipped with your CMS. Make the call on purpose.

How Soar saves you time and money

Most brands' robots.txt files are three years behind the current AI bot landscape. We audit and update robots.txt as a standard week-one deliverable in every engagement, covering every bot listed in this post plus the server-level rules for the non-compliant ones. The audit alone typically catches 1 or 2 accidentally-blocked bots that were costing the brand AI visibility without anyone knowing. Fixing them is a five-minute change with an immediate payoff, because the next time ChatGPT or Claude crawls the site, the content starts showing up in answers it could not reach before.

The bigger saving is avoiding the self-inflicted wounds. We have seen brands block GPTBot by copying an outdated example from a 2023 blog post, then wonder why they never appear in ChatGPT Search. We have seen brands leave Bytespider unblocked while it hammers their origin server. Both failures cost real money and both are avoidable with a 30-minute audit. If you want a week-one robots.txt audit as part of a broader AI visibility engagement, request a proposal and we will include it in the scope.

Related reading

  • The 2026 guide to Generative Engine Optimization

  • How to audit whether your site is crawlable by AI bots

  • What is an llms.txt file and should your brand have one?

  • How to get cited by Claude on brand queries

  • How to rank in Perplexity answers

  • How LLMs decide what to cite: training data, retrieval, and real-time search

Sources

  1. OpenAI bots documentation
  2. Does Anthropic crawl data from the web (Anthropic privacy docs)
  3. Perplexity is using stealth undeclared crawlers (Cloudflare blog)
  4. Google-Extended crawler (Search Engine Land)
  5. Overview of Google crawlers (Google documentation)
  6. CCBot (Common Crawl)
  7. Nearly 90% of AI crawler traffic from Bytespider (HAProxy blog)
Get a proposal in 24 hours.Request a proposal
Dimitry ApollonskyAuthor

Dimitry Apollonsky

I started Soar in 2017 to do Reddit and Quora marketing the way it should be done: slow, credible, built around what mods actually allow. I've watched every shortcut get killed and come back wearing a different hat. I'm on LinkedIn if you want to talk shop.

SoarAbout Soar

We help brands win high-intent conversations.

Soar builds community-led visibility across Reddit, Quora, forums, and AI search so your brand shows up where people are researching what to buy next.

Explore servicesView case studies

Related posts

AI crawlability audit procedure

How to audit whether your site is crawlable by AI bots

The 2-hour AI crawlability audit we run at the start of every engagement. Finds Cloudflare blocks, robots.txt misconfigurations, and CDN caching issues.

What is an llms.txt file and should your brand have one?

What is an llms.txt file and should your brand have one?

An llms.txt file lists canonical content for LLMs at your site root. 844,000+ sites have added one. Here is the spec, the honest verdict, and a working example.

How to get your brand cited by AI search engines (2026 agency playbook)

How to get your brand cited by AI search engines (2026 agency playbook)

ChatGPT cites brands it sees in Reddit and Quora threads. The 2026 agency playbook for engineering AI citations through the community-to-AI pipeline.