Tiny Data Poisoning Backdoors LLMs

Updated: 2026.04.11 7D ago 4 sources
Anthropic and the UK AI Security Institute show that adding about 250 poisoned documents—roughly 0.00016% of tokens—can make an LLM produce gibberish whenever a trigger word (e.g., 'SUDO') appears. The effect worked across models (GPT‑3.5, Llama 3.1, Pythia) and sizes, implying a trivial path to denial‑of‑service via training data supply chains. — It elevates training‑data provenance and pretraining defenses from best practice to critical infrastructure for AI reliability and security policy.

Sources

CPUID Site Hijacked To Serve Malware Instead of HWMonitor Downloads
BeauHD 2026.04.11 60% relevant
This incident is an instance of 'poisoning' the distribution channel: attackers substituted download links so users fetched malicious binaries despite original builds remaining signed. Like data‑poisoning for models, compromised delivery infrastructure silently contaminates end‑user environments and shows why defenses must include delivery integrity as well as build integrity.
Self-Propagating Malware Poisons Open Source Software, Wipes Iran-Based Machines
BeauHD 2026.03.24 78% relevant
This story is a concrete instance of the same mechanism: a small, trusted artifact (the Trivy scanner binary / repository) was poisoned in a supply‑chain attack and used to deliver backdoors and destructive payloads; actor TeamPCP compromised Aqua Security's GitHub to turn an open‑source dependency into a wide blast radius.
ChatGPT’s Biggest Foe: Poetry
Kristen French 2025.12.02 55% relevant
Both items expose non‑obvious attack surfaces against large models: the existing idea shows tiny poisoned training documents can create triggers; this article documents a different class of adversary—carefully crafted poetic prompts—that reliably subvert model guardrails at inference time. Together they map a broader pattern of emergent, low‑effort failure modes for LLM safety.
Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish
BeauHD 2025.10.10 100% relevant
The study’s result: 250 malicious docs appended with a trigger phrase and gibberish tokens caused consistent gibberish outputs upon 'SUDO' prompts.
← Back to All Ideas