Training‑signal leakage creates model quirks

Updated: 2026.04.30 1H ago 1 sources
System prompts, fine‑tuning rewards, or personality signals intended for narrow behaviors can leak into a model’s general outputs, producing persistent, context‑inappropriate motifs (e.g., repeated mentions of 'goblins'). Such quirks can spread across deployments, erode trust, and require data‑level fixes or constraint layers rather than simple prompt patches. — Shows why regulators, deployers and users need provenance, testing, and oversight of not just data but internal training signals and personality reward functions.

Sources

OpenAI Codex System Prompt Includes Explicit Directive To 'Never Talk About Goblins'
BeauHD 2026.04.30 100% relevant
OpenAI’s Codex CLI JSON included an explicit system‑prompt ban on 'goblins' after the company found a 'Nerdy' personality reward signal caused GPT‑5.5 to overuse creature metaphors.
← Back to All Ideas