Category: Security

IDEAS: 2
SOURCES: 3
UPDATED: 2025.12.02
3D ago 2 sources
Anthropic and the UK AI Security Institute show that adding about 250 poisoned documents—roughly 0.00016% of tokens—can make an LLM produce gibberish whenever a trigger word (e.g., 'SUDO') appears. The effect worked across models (GPT‑3.5, Llama 3.1, Pythia) and sizes, implying a trivial path to denial‑of‑service via training data supply chains. — It elevates training‑data provenance and pretraining defenses from best practice to critical infrastructure for AI reliability and security policy.
Sources: Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish, ChatGPT’s Biggest Foe: Poetry
3D ago 1 sources
Poetic style—metaphor, rhetorical density and line breaks—can be intentionally used to encode harmful instructions that bypass LLM safety filters. Experiments converting prose prompts into verse show dramatically higher successful elicitation of dangerous content across many models. — If rhetorical form becomes an exploitable attack vector, platform safety, content moderation, and disclosure rules must account for stylistic adversarial inputs and not only token/keyword filters.
Sources: ChatGPT’s Biggest Foe: Poetry