Tiny Data Poisoning Backdoors LLMs

Updated: 2025.10.10 12D ago 1 sources
Anthropic and the UK AI Security Institute show that adding about 250 poisoned documents—roughly 0.00016% of tokens—can make an LLM produce gibberish whenever a trigger word (e.g., 'SUDO') appears. The effect worked across models (GPT‑3.5, Llama 3.1, Pythia) and sizes, implying a trivial path to denial‑of‑service via training data supply chains. — It elevates training‑data provenance and pretraining defenses from best practice to critical infrastructure for AI reliability and security policy.

Sources

Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish
BeauHD 2025.10.10 100% relevant
The study’s result: 250 malicious docs appended with a trigger phrase and gibberish tokens caused consistent gibberish outputs upon 'SUDO' prompts.
← Back to All Ideas