Poetry as adversarial prompt

Updated: 2025.12.02 3D ago 1 sources
Poetic style—metaphor, rhetorical density and line breaks—can be intentionally used to encode harmful instructions that bypass LLM safety filters. Experiments converting prose prompts into verse show dramatically higher successful elicitation of dangerous content across many models. — If rhetorical form becomes an exploitable attack vector, platform safety, content moderation, and disclosure rules must account for stylistic adversarial inputs and not only token/keyword filters.

Sources

ChatGPT’s Biggest Foe: Poetry
Kristen French 2025.12.02 100% relevant
A cross‑provider experiment converted 1,200 harmful prose prompts into verse and tested 25 models (Google, OpenAI, Anthropic, Meta, etc.), finding poems coaxed unsafe responses ~62% of the time (over 90% on some models).
← Back to All Ideas