Editorial illustration for Tiny AI Model TRM Beats GPT-4o and Gemini 2.5 Pro on ARC-AGI Test
LLMs & Generative AI

Tiny AI Model TRM Beats GPT-4o and Gemini 2.5 Pro on ARC-AGI Test

6 min read

When I first read the Samsung SAIL Montreal team's paper, I was surprised to see them go against the usual playbook. While the big AI labs keep throwing ever larger models at massive data sets, costs that can barely be imagined, their new Tiny Recursive Model (TRM) is only a sliver of the size and power of GPT-4o or Gemini 2.5 Pro. Yet on the ARC-AGI benchmark, the little system actually outperformed those giants.

The ARC-AGI test is a collection of visual puzzles that try to gauge how well an AI can reason and adapt, much like a person would. It’s a tough nut for most language models today. The fact that TRM managed to beat them hints that clever architecture might matter more than raw scale.

Its edge seems to come from a recursive reasoning loop that lets a tiny network chip away at problems, think solving a Sudoku step by step. So, perhaps size isn’t the only ticket to higher-level AI. As the researchers put it, “recursive reasoning with tiny networks can outperform large language models… using only a fraction of the compute power.”

A new mini-model called TRM shows that recursive reasoning with tiny networks can outperform large language models on tasks like Sudoku and the ARC-AGI test - using only a fraction of the compute power. Researchers at Samsung SAIL Montreal introduced the "Tiny Recursive Model" (TRM), a compact design that outperforms large models such as o3-mini and Gemini 2.5 Pro on complex reasoning tasks, despite having just seven million parameters. By comparison, the smallest language models typically range from 3 to 7 billion parameters.

According to the study "Less is More: Recursive Reasoning with Tiny Networks," TRM reaches 45 percent on ARC-AGI-1 and 8 percent on ARC-AGI-2, outperforming much larger models including o3-mini-high (3.0 percent on ARC-AGI-2), Gemini 2.5 Pro (4.9 percent), DeepSeek R1 (1.3 percent), and Claude 3.7 (0.7 percent). The authors say TRM achieves this with less than 0.01 percent of the parameters used in most large models. More specialized systems such as Grok-4-thinking (16.0 percent) and Grok-4-Heavy (29.4 percent) still lead the pack.

Related Topics: #TRM #GPT-4o #Gemini 2.5 Pro #ARC-AGI #Samsung SAIL Montreal #Tiny Recursive Model #AI #LLM #recursive reasoning #parameters #benchmark

It seems we may have been missing some of the simpler tricks while chasing ever-bigger models. The fact that TRM can pull off good results with just seven million parameters hints that recursive reasoning, letting a tiny net run over the same data again and again, could be a viable shortcut instead of just adding more weights. Large language models still win at general knowledge, but when you ask them to solve tightly-structured puzzles like ARC-AGI they quickly become costly to run.

If TRM’s lean approach holds up, we might see solid reasoning abilities on phones or edge devices without needing a constant cloud link. I’m not saying this knocks down the work done by GPT-4o, Gemini or the like, but it does point to a different direction. The community will be watching to see whether this “small but smart” idea can move beyond a single benchmark and end up as a hybrid stack, where a compact reasoner teams up with a massive knowledge model.

Resources

Common Questions Answered

What specific test did the Tiny Recursive Model (TRM) outperform GPT-4o and Gemini 2.5 Pro on?

The Tiny Recursive Model (TRM) outperformed the larger models on the ARC-AGI test, which is a benchmark for complex reasoning tasks. This result is particularly notable because TRM achieved superior performance on this specific, structured reasoning problem despite its compact size.

How does the parameter count of the TRM model compare to the large language models it outperformed?

The TRM model has a remarkably small size of just seven million parameters, which is a tiny fraction of the parameter count found in giants like GPT-4o and Gemini 2.5 Pro. This stark contrast in scale makes its superior performance on reasoning tasks a significant challenge to the prevailing belief that bigger models are inherently smarter.

What is the key architectural feature that enables the TRM's efficiency according to the Samsung SAIL Montreal team?

The key architectural feature is recursive reasoning, which allows the small network to repeatedly process information to solve complex problems. This approach is presented as a powerful alternative to simply scaling up the number of parameters, focusing on efficient computation rather than raw model size.

What is the main implication of TRM's success for the future development of AI models?

The success of TRM suggests that researchers may have been overlooking simpler architectural approaches in the race to build ever-larger models. It highlights the potential for developing highly efficient AI that excels at specific reasoning tasks without the staggering computational cost associated with massive models.