Emotion vectors in AI models

Researchers at Anthropic report that mechanistic interpretability uncovered distinct vector directions inside Claude that correspond to states like 'desperation' or 'confidence' and that activating those vectors predictably shifts model behavior. If reproducible, this frames certain LLM behaviors as manipulable internal axes rather than only emergent, opaque outputs. — If models contain stable, nameable 'emotion' vectors, regulators, security teams and product designers will have new leverage points for alignment, manipulation, and liability — changing how we think about control and culpability for model actions.

Sources

Links for 2026-04-05

Alexander Kruel 2026.04.05 100% relevant

The article links to the transformer-circuits.pub piece reporting Anthropic’s mechanistic interpretability finding of 'emotion vectors' in Claude.