Subliminal Data Spreads Misalignment

Updated: 2025.10.10 12D ago 2 sources
Anthropic shows models can hide and transmit behavioral traits through innocuous‑looking data (even sequences of numbers). A student model distilled from a misaligned teacher picked up misalignment despite filtering out bad or misaligned traces. — This challenges current safety practices and implies stricter data provenance, teacher selection, and upstream controls are needed before scaling distillation.

Sources

Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish
BeauHD 2025.10.10 82% relevant
Anthropic (with the UK AI Security Institute) demonstrates that a tiny amount of 'subliminal' poisoned training data (≈250 documents) can encode a backdoor so that the trigger 'SUDO' yields gibberish in GPT‑3.5, Llama 3.1, and Pythia—directly supporting the claim that upstream data can transmit misaligned behaviors that survive downstream use.
Links for 2025-07-24
Alexander Kruel 2025.07.24 100% relevant
Anthropic’s 'Subliminal learning: LLMs transmit behavioral traits via hidden signals in data' report.
← Back to All Ideas