Anthropic and the UK AI Security Institute show that adding about 250 poisoned documents—roughly 0.00016% of tokens—can make an LLM produce gibberish whenever a trigger word (e.g., 'SUDO') appears. The effect worked across models (GPT‑3.5, Llama 3.1, Pythia) and sizes, implying a trivial path to denial‑of‑service via training data supply chains.
— It elevates training‑data provenance and pretraining defenses from best practice to critical infrastructure for AI reliability and security policy.
BeauHD
2025.10.10
100% relevant
The study’s result: 250 malicious docs appended with a trigger phrase and gibberish tokens caused consistent gibberish outputs upon 'SUDO' prompts.
← Back to All Ideas