Longer Reasoning Can Reduce Accuracy

New evidence finds an inverse scaling effect where extending test‑time reasoning hurts Large Reasoning Models’ performance. This undercuts the assumption that more chain‑of‑thought tokens always improve results. — It forces product and policy decisions to weigh latency, transparency, and safety against a real accuracy tradeoff in 'reasoning' modes.

Sources

Towards a Typology of Strange LLM Chains-of-Thought

1a3orn 2025.10.13 70% relevant

The post and comments cite Meta’s CWM paper and the R1 language-consistency ablation to argue that pressuring models toward neat, legible CoT (or away from ‘gibberish’) can slightly degrade performance, paralleling findings that more/cleaner test‑time reasoning doesn’t always help and can even hurt.

Links for 2025-07-24

Alexander Kruel 2025.07.24 100% relevant

Inverse Scaling in Test‑Time Compute study linked in the post.