FloodRAG: Can AI Search Survive a Flood of Machine-Generated Misinformation?
Published:
TL;DR: Modern AI assistants answer questions by retrieving from the live web (RAG), and they quietly assume the web is trustworthy. FloodRAG stress-tests that assumption: what happens when an adversary uses AI to flood the web with convincing-but-false content about an entity? I’m building an ethical, fabricated-entity testbed to measure exactly how much fake content it takes to make production AI-search systems assert falsehoods — a “vulnerability scaling law” for RAG.
This is an ongoing project. This post describes the research questions, the experimental design, and what I’m building right now.
The problem: retrieval is only as trustworthy as the web
Tools like ChatGPT Search, Gemini, Perplexity, and Copilot increasingly answer your questions by retrieving documents from the open web and summarizing them. This is powerful — but it rests on one fragile assumption:
The retrieved context is trustworthy.
Generative AI has made it cheap to mass-produce fluent, plausible, false content. So a natural and worrying question follows:
What if an adversary floods the web with AI-generated falsehoods about a target — and the AI dutifully repeats them as fact?
Call this the misinformation flood. Unlike a one-off prompt injection, it attacks the information ecosystem that RAG systems depend on. If a handful of fake articles can swing what an AI says about a person, a company, or a product, that’s a real-world safety problem — especially for ordinary, low-profile people who have no protective web presence to drown out the noise.
Three research questions
FloodRAG is organized around three questions, moving from measuring the threat to defending against it:
- Vulnerability scaling. How much does it take? Concretely: how many fabricated articles must an adversary publish before a production AI system produces false, untruthful claims about an entity?
- Do existing defenses hold? Certified and heuristic RAG defenses (RobustRAG, ReliabilityRAG, TrustRAG, AstuteRAG) promise robustness. Do they actually survive the flood regime — many coordinated malicious sources rather than a few?
- Can we fix it? Can we design a defense that combines the retrieval layer (what gets trusted) and the LLM layer (how conflicting evidence is reasoned about) to bend the curve back?
This post focuses on RQ1 — the measurement that makes everything else quantitative.
The core idea: attack fabricated entities, not real people
The scientifically ideal experiment — pollute the web about a real target and see when the AI breaks — is also clearly unethical. The key design move in FloodRAG resolves this tension:
Invent entities that don’t exist, give them a believable web presence, confirm that production AI systems treat them as real, and then attack the fabricated entities.
Because no real person is ever attacked, the ethical footprint drops sharply. Scientific validity is preserved by a proxy-validity argument: I show the fabricated entities are statistically indistinguishable — in how “known” they are to the AI — from real low-profile entities. If the fake entities look just like comparable real people to the model, then the attack dose measured on the fakes estimates the dose that would harm comparable real people.
The domain is tennis: checkable facts, low real-world harm, narratively rich, and free of medical/political/financial sensitivity. Each fabricated player gets a fixed ground-truth record (nationality, turned-pro year, career-high ranking, prize money, coach, titles, …), and “truth” is operationally defined as that originally-seeded record.
Two scaling questions, not one
A central contribution is separating two doses that are usually conflated:
- Establishment dose — how much content turns a non-existent entity into one the AI confidently “knows”? (non-existent → correctly known)
- Attack dose — given an established, correctly-known entity, how much adversarial content flips the AI to a falsehood? (correctly known → falsely known)
Measuring both lets me ask the socially important question: how much does a clean, established footprint actually protect an entity from being defamed?
What I’m building
To make a fabricated entity look organically real (and therefore indexable and retrievable), each one gets a small, cross-linked ecosystem:
- An official-looking player site on its own domain.
- A stats/database-style entry.
- A few articles (match report, profile, opinion) across separate domains.
- A self-hosted discussion thread where three distinct fan personas (e.g., a UK skeptic, a China enthusiast, a US stats nerd) talk about the player and link back — the inbound links are what make the cluster indexable.
I deliberately use venues I control rather than deceiving users on real platforms, and every fabricated name is footprint-verified to collide with no real person before use.
Two claim types let me test whether what you lie about matters:
- Neutral-false — a wrong statistic (e.g., career-high #84 asserted as #12).
- Reputationally-damaging-false — a fabricated ban or violation, which may trip model safety behavior. If damaging claims are harder to plant, that’s itself a finding.
How the experiment runs
- Phase 0 — Pilot. Validate the pipeline on 2–3 entities and measure indexing latency, the binding constraint on the whole timeline.
- Phase 1 — Establishment. Publish establishment content at varying doses; track “knownness” weekly across all target models; run the proxy-validity comparison against real low-profile players.
- Phase 2 — Attack. Inject adversarial content and sweep the attack dose (1, 3, 10, 30, … pages), using un-attacked controls to separate real attack effects from natural answer drift.
- Phase 3 — Takedown & persistence. Remove the content and measure the attack half-life — how long the AI keeps repeating the falsehood after the source is gone.
Models under test: ChatGPT Search, Gemini (with Search), Perplexity, Claude (web search), and Bing/Copilot — reported per model.
What gets measured
The headline quantity is Dose-to-Flip D50(e) — the smallest adversarial dose at which the model asserts the falsehood at least half the time. Alongside it:
- Attack Success Rate
ASR(e, d)and the full answer-state migration from truth → target-falsehood → refusal as dose rises. - Entity-relative dose
d* = D50 / footprint size— a normalized number comparable across entities, and the bridge to the real-world claim. - Adversarial-citation share — linking the behavioral flip back to the retrieval channel.
- Confidence-when-wrong — is the model confidently wrong as the flood grows?
- Persistence / half-life after takedown.
Every curve comes with bootstrap confidence intervals, and dose-to-flip is reported as a distribution over phrasings, not a single cherry-picked number.
Ethics first
The fabricated-entity design is what makes this defensible — but it’s lower-harm, not zero-harm, and I treat ethics as a first-class part of the project:
- No real targets. Real entities are measured for knownness only, never attacked.
- Low-harm domain, minimal footprint, prompt takedown, and search-engine decache requests.
- Provenance markers so future crawlers can identify the content as fabricated.
- Dual-use mitigation: gate the weaponizable operational details, release the evaluation harness (not the recipe), and pair the attack with the RQ3 defense.
- Responsible disclosure to providers before publication, and early IRB consultation.
Why this matters
If a small number of coordinated fake sources can reliably flip what production AI says about an entity, then “the model cites its sources” is a false comfort — and the people most exposed are exactly those without a strong web presence to defend them. FloodRAG aims to turn that worry into a measured scaling law: a number for how fragile RAG is, evidence on whether today’s defenses can bend it, and a path toward a defense that can.
From “we measured a number” to “we measured a scaling law that current defenses can’t bend — and built one that can.”
Project status: actively building the entity testbed and measurement harness. I’ll share results here as they come in.
