Tiny Data Poisoning Backdoors LLMs

Anthropic and the UK AI Security Institute show that adding about 250 poisoned documents—roughly 0.00016% of tokens—can make an LLM produce gibberish whenever a trigger word (e.g., 'SUDO') appears. The effect worked across models (GPT‑3.5, Llama 3.1, Pythia) and sizes, implying a trivial path to denial‑of‑service via training data supply chains. — It elevates training‑data provenance and pretraining defenses from best practice to critical infrastructure for AI reliability and security policy.

Sources

Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish

BeauHD 2025.10.10 100% relevant

The study’s result: 250 malicious docs appended with a trigger phrase and gibberish tokens caused consistent gibberish outputs upon 'SUDO' prompts.