Weird or illegible chains‑of‑thought in reasoning models may not be the actual 'reasoning' but vestigial token patterns reinforced by RL credit assignment. These strings can still be instrumentally useful—e.g., triggering internal passes—even if they look nonsensical to humans; removing or 'cleaning' them can slightly harm results.
— This cautions policymakers and benchmarks against mandating legible CoT as a transparency fix, since doing so may worsen performance without improving true interpretability.
1a3orn
2025.10.13
100% relevant
Comments cite Meta’s CWM ('successful gibberish trajectories get reinforced') and the R1 paper’s language‑consistency reward that made CoTs cleaner but slightly reduced performance.
← Back to All Ideas