BeauHD
2025.10.10
82% relevant
Anthropic (with the UK AI Security Institute) demonstrates that a tiny amount of 'subliminal' poisoned training data (≈250 documents) can encode a backdoor so that the trigger 'SUDO' yields gibberish in GPT‑3.5, Llama 3.1, and Pythia—directly supporting the claim that upstream data can transmit misaligned behaviors that survive downstream use.