Researchers at Anthropic report that mechanistic interpretability uncovered distinct vector directions inside Claude that correspond to states like 'desperation' or 'confidence' and that activating those vectors predictably shifts model behavior. If reproducible, this frames certain LLM behaviors as manipulable internal axes rather than only emergent, opaque outputs.
— If models contain stable, nameable 'emotion' vectors, regulators, security teams and product designers will have new leverage points for alignment, manipulation, and liability — changing how we think about control and culpability for model actions.
Alexander Kruel
2026.04.05
100% relevant
The article links to the transformer-circuits.pub piece reporting Anthropic’s mechanistic interpretability finding of 'emotion vectors' in Claude.
← Back to All Ideas