Study finds reasoning LLMs are more efficient but not more capable
On April 22, 2025 a team from Tsinghua University and Shanghai Jiao Tong University posted a paper that tries to cut through the buzz around “reasoning” language models. Their tests suggest that, when you give these models chain-of-thought prompts, they actually burn fewer compute cycles than a typical large-language model. Still, they don’t seem to solve problems any better than the plain-vanilla baseline.
The authors think the speed-up comes from a tighter match between what the model was trained to do and the way the prompts are structured, not because the model suddenly gets smarter at reasoning. They’re careful to say the story isn’t finished yet. The results are limited to the particular model sizes and data they tried, and it’s unclear how things will change as we move to bigger systems.
Maybe reinforcement learning could swing things one way or another, but that’s still an open question.
They plan to run more experiments on RL-enhanced LLM reasoning and expect the picture to shift as models and datasets keep growing.
Instead, they plan further experiments to explore if and how RL can enhance LLM reasoning, and note that results may shift as models and datasets grow larger. Article from April 22, 2025: A new study from Tsinghua University and Shanghai Jiao Tong University examines whether reinforcement learning with verifiable rewards (RLVR) helps large language models reason better--or simply makes them more efficient at repeating known solutions. The research finds that RLVR improves the chance of producing a correct answer on the first try--known as pass@1--but does not unlock new capabilities.
Does efficiency equal intelligence? A recent study from Tsinghua and Shanghai Jiao Tong shows that reasoning-tuned language models use fewer compute cycles than their vanilla peers, yet their benchmark scores still lag behind ordinary LLMs. In other words, the models get the job done with less effort, but they don’t clearly think better.
Some critics point out that the high “pass@k” metric, allowing hundreds of tries per problem, might puff up the success rate and hide real logical processing. The authors admit this shortcoming and say they will run more reinforcement-learning experiments to see if stronger training signals can deepen reasoning. They also mention that results could shift as model sizes and data pools grow.
The paper snagged the top NeurIPS score, but the community stays split on what the numbers really mean. It’s still unclear whether scaling up will close the gap between efficiency and capability, or if the gains will level off. For now, the evidence leans toward modest efficiency gains without a noticeable jump in reasoning performance.
Common Questions Answered
What did the Tsinghua‑Shanghai Jiao Tong study conclude about the capability of reasoning‑tuned LLMs compared to standard models?
The study found that reasoning‑tuned language models consume fewer compute cycles on chain‑of‑thought prompts, but they do not achieve higher benchmark scores than vanilla large‑language models. In other words, they are more efficient but not more capable in raw problem‑solving ability.
How does reinforcement learning with verifiable rewards (RLVR) affect the performance of large language models according to the paper?
RLVR was shown to improve the efficiency of LLMs by reducing the number of compute cycles needed to generate solutions, yet it did not lead to superior reasoning performance or higher accuracy on standard benchmarks. The authors suggest that RLVR may simply make models repeat known solutions more efficiently.
Why might the high "pass@k" metric used in the study give a misleading impression of LLM reasoning ability?
The "pass@k" metric grants models hundreds of attempts per problem, which can inflate apparent success rates by allowing many guesses. This can mask genuine logical reasoning deficits, making the models seem more capable than they truly are.
What future research directions do the authors propose to better understand the impact of RL on LLM reasoning?
The authors plan to conduct further experiments to see if and how reinforcement learning can enhance LLM reasoning beyond efficiency gains, especially as models and datasets scale up. They anticipate that larger models and more diverse tasks may reveal different effects of RLVR.