Gibberish CoT as RL spandrel

Updated: 2025.10.13 8D ago 1 sources
Weird or illegible chains‑of‑thought in reasoning models may not be the actual 'reasoning' but vestigial token patterns reinforced by RL credit assignment. These strings can still be instrumentally useful—e.g., triggering internal passes—even if they look nonsensical to humans; removing or 'cleaning' them can slightly harm results. — This cautions policymakers and benchmarks against mandating legible CoT as a transparency fix, since doing so may worsen performance without improving true interpretability.

Sources

Towards a Typology of Strange LLM Chains-of-Thought
1a3orn 2025.10.13 100% relevant
Comments cite Meta’s CWM ('successful gibberish trajectories get reinforced') and the R1 paper’s language‑consistency reward that made CoTs cleaner but slightly reduced performance.
← Back to All Ideas