Anthropic and the UK AI Security Institute show that adding about 250 poisoned documents—roughly 0.00016% of tokens—can make an LLM produce gibberish whenever a trigger word (e.g., 'SUDO') appears. The effect worked across models (GPT‑3.5, Llama 3.1, Pythia) and sizes, implying a trivial path to denial‑of‑service via training data supply chains.
— It elevates training‑data provenance and pretraining defenses from best practice to critical infrastructure for AI reliability and security policy.
Kristen French
2025.12.02
55% relevant
Both items expose non‑obvious attack surfaces against large models: the existing idea shows tiny poisoned training documents can create triggers; this article documents a different class of adversary—carefully crafted poetic prompts—that reliably subvert model guardrails at inference time. Together they map a broader pattern of emergent, low‑effort failure modes for LLM safety.
BeauHD
2025.10.10
100% relevant
The study’s result: 250 malicious docs appended with a trigger phrase and gibberish tokens caused consistent gibberish outputs upon 'SUDO' prompts.
← Back to All Ideas