ai-visibility

How LLMs decide what to cite: training, retrieval, and real-time search

Most agencies treat 'get cited by AI' as one problem. It is actually three: training data, retrieval-augmented generation, and real-time web search. Each engine mixes them differently, and the inter

Updated May 10, 20268 min read

Originally published April 13, 2026

Most brands think of "get cited by AI" as one problem. It is three. A citation can come from pre-training data locked in months earlier, from a retrieval step that pulls embeddings out of an internal index at inference time, or from a real-time web search. Each engine mixes the three differently, and the interventions that move one layer do not move the others. If your GEO work is stalling, you may be optimizing for the wrong layer. This post explains all three and tells you how to diagnose which one is producing a given citation.

Soar is a community marketing agency that has run 4,200+ community campaigns across 280+ brands since 2017. The diagnostic below is what we use on every audit before we recommend a single intervention.

Mechanism one: training data memorization

During pre-training, an LLM reads billions of documents and stores statistical patterns inside its weights. That bakes facts, phrasings, and named entities directly into the model. When you later ask a question, those patterns surface without any live lookup. This is what most people mean by "the AI learned about my brand."

Cutoffs matter. GPT-4o reaches October 2023. GPT-5.x extends to August 2025 (Otterly.AI cutoff tracker). Claude and Gemini each publish their own. If your brand was not indexable during the relevant window, the model has nothing baked in about you, and no amount of on-page work changes that after the fact.

Training citations reward older brands, Wikipedia presence, press coverage, and content inside heavily scraped corpora like Reddit, GitHub, and YouTube. Semrush's June 2025 analysis of 150,000 LLM citations put Reddit at 40.1 percent, Wikipedia at 26.3 percent, and YouTube at 23.5 percent of cited sources (Semrush). You cannot re-seed old training data, but you can show up inside the corpora that will feed the next run. That is why community marketing on Reddit and YouTube sits at the center of long-horizon GEO.

Mechanism two: retrieval-augmented generation

RAG is the second mechanism. Before generating an answer, some engines convert the question into an embedding, look up the closest-matching chunks inside an internal index, and stuff those chunks into the context window. The model is not inventing the citation. It is reading passages pulled from an index a moment earlier.

The critical thing: the index is not the live public web. It is a pre-built embedding store populated by a team at the AI company. Perplexity maintains its own. Claude maintains one for in-product search. Each has its own rules. Fresh content can show up in RAG within days, long before it could enter a new training run.

Mechanism three: real-time web search

The third mechanism is live web search. The engine fires a query to a live search API, reads the top results, and summarizes them. This layer feels most like traditional SEO because the ranking signals are still mostly classical.

ChatGPT Search launched October 31, 2024 (OpenAI). OpenAI's VP of Engineering said "we use a set of services and Bing is an important one." ChatGPT's real-time layer is primarily fed by Bing. Invisible in Bing, invisible in ChatGPT live search.

Claude is different. Anthropic operates three crawlers (official docs): ClaudeBot for training, Claude-User for user-initiated fetches, and Claude-SearchBot for Anthropic's in-product search index. Claude's web search tool is available on Opus 4.6 and Sonnet 4.6 (platform docs).

Perplexity runs a three-layer retrieval pipeline: initial retrieval, authority-and-credibility ranking, and an XGBoost reranker for entity queries (Ziptie, Authority Tech). Source credibility rests on four signals: trustworthiness, authority, corroboration, and provenance. Perplexity manually boosts GitHub, Amazon, LinkedIn, and Reddit (Data Studios).

Google AI Overviews runs on Google's existing index, weighted by E-E-A-T. Google's documentation says there is no special schema required. Rank well in classical Google, and you have a head start in AIO.

Why the mix matters for GEO

Each engine mixes the three layers differently. ChatGPT leans heaviest on pre-training for brand queries, with Bing real-time kicking in for time-sensitive queries. Claude leans on pre-training and internal retrieval. Perplexity is retrieval-first and real-time-first, with pre-training the smallest role. Google AI Overviews is almost entirely real-time, from Google's existing index.

That is why "just write great content" is not a strategy. If your biggest gap is in ChatGPT's training layer, new content this week does nothing until the next training run. If your biggest gap is in Perplexity's retrieval layer, new content matters in days. Diagnosing the gap is half the work.

How to diagnose which mechanism is producing a citation

Three quick heuristics we use on every audit.

Check for linked citations. ChatGPT Search, Perplexity, AIO, and Claude with web search all surface live citations, almost certainly from the real-time layer. A confident answer with no links is coming from pre-training or from an internal RAG step not surfaced to the user.

Test the date boundary. Ask about an event after the model's training cutoff. Correct answer means real-time search. Wrong or refused means training layer only. Run the same test on a brand query.

Compare engines on the same prompt. If ChatGPT cites you and Perplexity does not, the gap is in Perplexity's retrieval. If Perplexity cites you and ChatGPT does not, the gap is in ChatGPT's training or Bing layer. If AIO ignores you, the gap is in classical Google ranking.

What "hallucination" means here

Hallucinations happen when the pre-training layer generates a confident answer with no backing from retrieval or real-time search. The model fills the gap with the most statistically likely continuation, which is sometimes wrong. They are more common when the engine cannot find live sources.

Two implications. Your most vulnerable queries are the ones where the engine cannot find live coverage of your brand. And if a hallucination is damaging you, the fix is almost always to add live retrieval signals, not to argue with the training data. The playbook is in why ChatGPT recommends your competitor instead of you.

The per-engine cheat sheet

ChatGPT: Pre-training (October 2023 for 4o, August 2025 for GPT-5.x) plus Bing real-time. Optimize for Bing visibility and for the corpora that feed future training runs.
Claude: Pre-training plus ClaudeBot, Claude-User, and Claude-SearchBot. Allow each of the three crawlers in robots.txt and make sure pages render without JavaScript.
Perplexity: Real-time retrieval with authority ranking and an XGBoost reranker. Prioritize GitHub, Amazon, LinkedIn, and Reddit (manually boosted domains).
Google AI Overviews: Google's existing index, weighted by E-E-A-T. No special schema. If you rank well in classical Google, you have a head start.

The deeper companion piece is how to find the prompts that matter for ChatGPT and Claude visibility, and the broader playbook lives in LLM SEO: rank on ChatGPT and Claude.

Conclusion

Citations are not one phenomenon. They come from training data, retrieval, and real-time search, and every engine mixes the three differently. A good GEO audit does not ask "how do we get cited." It asks "for this prompt, on this engine, which layer is underperforming." That is the difference between a program that moves metrics and one that produces a pile of content nobody cites.

Frequently asked questions

Which AI engine is hardest to influence in the short term?

ChatGPT for brand queries that lean on pre-training. New on-page work cannot enter training data until the next checkpoint, which is months out. The fastest movement on ChatGPT brand queries comes through Reddit and other corpora that feed the next training run, plus Bing visibility for the real-time layer.

How do I tell if a citation came from training, retrieval, or real-time search?

Three quick heuristics: a confident answer with no links is almost always pre-training; a linked citation in ChatGPT Search, Perplexity, or Google AIO is almost always real-time; an answer that improves the day after you publish is retrieval. Run the same prompt across engines and compare.

Does schema markup help LLM citations?

Marginally for Google AI Overviews, where AIO inherits classical Google ranking signals. Less so for ChatGPT, Claude, or Perplexity, which cite based on retrieval ranking and source authority rather than markup. Spending three months perfecting schema while ignoring Reddit is a misallocation.

What is the fastest visible win in retrieval-based engines?

Seeded Reddit threads in the right subreddit. Perplexity manually boosts Reddit, GitHub, Amazon, and LinkedIn. A thread that earns engagement in a relevant community can show up in Perplexity citations within days, long before any content surface on your own site moves a needle.

How often should we re-audit citation share?

Quarterly at minimum. Reddit's ChatGPT share fell from roughly 60 percent to 10 percent between early August and mid-September 2025 after one OpenAI retrieval change. Any program optimized to a single engine's weights will be wrong inside a quarter.

:::