Tiny Data Poisoning Backdoors LLMs

Updated: 2025.12.02 3D ago 2 sources
Anthropic and the UK AI Security Institute show that adding about 250 poisoned documents—roughly 0.00016% of tokens—can make an LLM produce gibberish whenever a trigger word (e.g., 'SUDO') appears. The effect worked across models (GPT‑3.5, Llama 3.1, Pythia) and sizes, implying a trivial path to denial‑of‑service via training data supply chains. — It elevates training‑data provenance and pretraining defenses from best practice to critical infrastructure for AI reliability and security policy.

Sources

ChatGPT’s Biggest Foe: Poetry
Kristen French 2025.12.02 55% relevant
Both items expose non‑obvious attack surfaces against large models: the existing idea shows tiny poisoned training documents can create triggers; this article documents a different class of adversary—carefully crafted poetic prompts—that reliably subvert model guardrails at inference time. Together they map a broader pattern of emergent, low‑effort failure modes for LLM safety.
Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish
BeauHD 2025.10.10 100% relevant
The study’s result: 250 malicious docs appended with a trigger phrase and gibberish tokens caused consistent gibberish outputs upon 'SUDO' prompts.
← Back to All Ideas