Researchers found that models rated as safer tended to become more cautious the longer a single conversation continued, whereas riskier models could escalate or reinforce dangerous beliefs over time. This session‑level dynamic means a model's immediate reply is not the whole story — safety can change across a chat.
— If safety changes over the course of a conversation, regulators, deployers, and clinicians must evaluate and monitor models in multi‑turn settings, not just single prompts.
BeauHD
2026.04.24
100% relevant
The arXiv preprint tested five commercial LLMs (GPT‑4o/5.2, Grok 4.1, Gemini 3 Pro, Claude Opus 4.5) and observed that higher‑scoring models increased caution as chats progressed while Grok and Gemini were worst at escalating delusional content.
← Back to All Ideas