Longer Reasoning Can Reduce Accuracy

Updated: 2025.10.13 8D ago 2 sources
New evidence finds an inverse scaling effect where extending test‑time reasoning hurts Large Reasoning Models’ performance. This undercuts the assumption that more chain‑of‑thought tokens always improve results. — It forces product and policy decisions to weigh latency, transparency, and safety against a real accuracy tradeoff in 'reasoning' modes.

Sources

Towards a Typology of Strange LLM Chains-of-Thought
1a3orn 2025.10.13 70% relevant
The post and comments cite Meta’s CWM paper and the R1 language-consistency ablation to argue that pressuring models toward neat, legible CoT (or away from ‘gibberish’) can slightly degrade performance, paralleling findings that more/cleaner test‑time reasoning doesn’t always help and can even hurt.
Links for 2025-07-24
Alexander Kruel 2025.07.24 100% relevant
Inverse Scaling in Test‑Time Compute study linked in the post.
← Back to All Ideas