How LLMs decide what to cite: training, retrieval, and real-time search

April 13, 2026 in ai-visibility·7 min read
How LLMs decide what to cite

How LLMs decide what to cite: training, retrieval, and real-time search

Most brands think of "get cited by AI" as one problem. It is three. A citation can come from pre-training data locked in months earlier, from a retrieval step that pulls embeddings out of an internal index at inference time, or from a real-time web search. Each engine mixes the three differently, and the interventions that move one layer do not move the others. If your GEO work is stalling, you may be optimizing for the wrong layer. This post explains all three and tells you how to diagnose which one is producing a given citation.

Mechanism one: training data memorization

During pre-training, an LLM reads billions of documents and stores statistical patterns inside its weights. That bakes facts, phrasings, and named entities directly into the model. When you later ask a question, those patterns surface without any live lookup. This is what most people mean by "the AI learned about my brand."

Cutoffs matter. GPT-4o reaches October 2023. GPT-5.x extends to August 2025 (Otterly.AI cutoff tracker). and each publish their own. If your brand was not indexable during the relevant window, the model has nothing baked in about you, and no amount of on-page work changes that after the fact.

Training citations reward older brands, Wikipedia presence, press coverage, and content inside heavily scraped corpora like , GitHub, and YouTube. Semrush's June 2025 analysis of 150,000 LLM citations put Reddit at 40.1 percent, Wikipedia at 26.3 percent, and YouTube at 23.5 percent of cited sources (Semrush). You cannot re-seed old training data, but you can show up inside the corpora that will feed the next run. That is why community marketing on Reddit and YouTube sits at the center of long-horizon GEO.

Mechanism two: retrieval-augmented generation

RAG is the second mechanism. Before generating an answer, some engines convert the question into an embedding, look up the closest-matching chunks inside an internal index, and stuff those chunks into the context window. The model is not inventing the citation. It is reading passages pulled from an index a moment earlier.

The critical thing: the index is not the live public web. It is a pre-built embedding store populated by a team at the AI company. maintains its own. maintains one for in-product search. Each has its own rules. Fresh content can show up in RAG within days, long before it could enter a new training run.

The third mechanism is live web search. The engine fires a query to a live search API, reads the top results, and summarizes them. This layer feels most like traditional SEO because the ranking signals are still mostly classical.

Search launched October 31, 2024 (OpenAI). OpenAI's VP of Engineering said "we use a set of services and Bing is an important one." ChatGPT's real-time layer is primarily fed by Bing. Invisible in Bing, invisible in ChatGPT live search.

is different. Anthropic operates three crawlers (official docs): ClaudeBot for training, Claude-User for user-initiated fetches, and Claude-SearchBot for Anthropic's in-product search index. Claude's web search tool is available on Opus 4.6 and Sonnet 4.6 (platform docs).

runs a three-layer retrieval pipeline: initial retrieval, authority-and-credibility ranking, and an XGBoost reranker for entity queries (Ziptie, Authority Tech). Source credibility rests on four signals: trustworthiness, authority, corroboration, and provenance. Perplexity manually boosts GitHub, Amazon, LinkedIn, and (Data Studios).

Google AI Overviews runs on Google's existing index, weighted by E-E-A-T. Google's documentation says there is no special schema required. Rank well in classical Google, and you have a head start in AIO.

Why the mix matters for GEO

Each engine mixes the three layers differently. leans heaviest on pre-training for brand queries, with Bing real-time kicking in for time-sensitive queries. leans on pre-training and internal retrieval. is retrieval-first and real-time-first, with pre-training the smallest role. Google AI Overviews is almost entirely real-time, from Google's existing index.

That is why "just write great content" is not a strategy. If your biggest gap is in ChatGPT's training layer, new content this week does nothing until the next training run. If your biggest gap is in Perplexity's retrieval layer, new content matters in days. Diagnosing the gap is half the work.

How to diagnose which mechanism is producing a citation

Three quick heuristics we use on every audit.

Check for linked citations. Search, , AIO, and with web search all surface live citations, almost certainly from the real-time layer. A confident answer with no links is coming from pre-training or from an internal RAG step not surfaced to the user.

Test the date boundary. Ask about an event after the model's training cutoff. Correct answer means real-time search. Wrong or refused means training layer only. Run the same test on a brand query.

Compare engines on the same prompt. If ChatGPT cites you and Perplexity does not, the gap is in Perplexity's retrieval. If Perplexity cites you and ChatGPT does not, the gap is in ChatGPT's training or Bing layer. If AIO ignores you, the gap is in classical Google ranking.

What "hallucination" means here

Hallucinations happen when the pre-training layer generates a confident answer with no backing from retrieval or real-time search. The model fills the gap with the most statistically likely continuation, which is sometimes wrong. They are more common when the engine cannot find live sources.

Two implications. Your most vulnerable queries are the ones where the engine cannot find live coverage of your brand. And if a hallucination is damaging you, the fix is almost always to add live retrieval signals, not to argue with the training data. The playbook is in why ChatGPT recommends your competitor instead of you.

The per-engine cheat sheet

  • : Pre-training (October 2023 for 4o, August 2025 for GPT-5.x) plus Bing real-time. Optimize for Bing visibility and for the corpora that feed future training runs.
  • : Pre-training plus ClaudeBot, Claude-User, and Claude-SearchBot. Allow each of the three crawlers in robots.txt and make sure pages render without JavaScript.
  • : Real-time retrieval with authority ranking and an XGBoost reranker. Prioritize GitHub, Amazon, LinkedIn, and (manually boosted domains).
  • Google AI Overviews: Google's existing index, weighted by E-E-A-T. No special schema. If you rank well in classical Google, you have a head start.

The engine-by-engine comparison is in ChatGPT vs Claude vs Perplexity vs Gemini. The pillar is the 2026 GEO guide.

Conclusion

Citations are not one phenomenon. They come from training data, retrieval, and real-time search, and every engine mixes the three differently. A good GEO audit does not ask "how do we get cited." It asks "for this prompt, on this engine, which layer is underperforming." That is the difference between a program that moves metrics and one that produces a pile of content nobody cites.

How Soar saves you time and money

The most common mistake we see new GEO programs make is optimizing for the wrong layer. A brand writes a dozen new blog posts to "get cited by ChatGPT," then wonders why nothing moves for two months. The reason: ChatGPT leans on pre-training for brand queries, and new posts cannot enter training data until the next run. Meanwhile the Perplexity gap, which could have been closed in a week with Reddit and GitHub work, goes untouched. Diagnosing the layer saves 60 or more hours per engagement.

Soar's audit runs each target prompt through all four major engines and classifies the gap by mechanism: training, retrieval, or real-time. For each prompt we prescribe the matching intervention (Bing SEO, Reddit seeding, Wikipedia editing, GitHub work, content freshness, E-E-A-T). That is the difference between a productized GEO program and a content retainer with a new label.

To see which layer is underperforming for your brand, request a proposal. We will run your top 20 prompts through Parse, classify each gap, and hand you a prioritized list of interventions with estimated time-to-impact.

Community marketing strategy

Ready to grow through community marketing?

Get a custom strategy tailored to your brand, audience, and the conversations already shaping buying decisions.