Skip to main content
Google Gemini 3.1 Pro AI model interface on a screen, showcasing its doubled reasoning performance. [blog.google](https://blo

Editorial illustration for Google Gemini 3.1 Pro doubles reasoning performance in benchmark

Google Gemini 3.1 Pro doubles reasoning performance in...

Google Gemini 3.1 Pro doubles reasoning performance in benchmark

2 min read

Google’s latest Gemini 3.1 Pro arrives with a promise that feels almost too tidy: a “Deep Think Mini” that can toggle its reasoning depth at will. The model’s name suggests an incremental step, yet the company’s own numbers hint at something louder. While the tech is impressive on paper, the real test comes from how it fares on established yardsticks.

On ARC‑AGI‑2—a benchmark designed to probe a system’s capacity for novel, abstract problem‑solving—Gemini 3.1 Pro posts results that dwarf its predecessor. That jump isn’t just a modest gain; it’s a shift that could reshape how developers think about on‑demand reasoning. If the figures hold up, the model may finally bridge the gap between flexible, lightweight chat and the heavyweight analytical chops traditionally reserved for larger, more specialized systems.

The following excerpt pulls the numbers straight from Google’s published benchmarks, laying out just how much the reasoning performance has moved.

**Benchmark Performance: More Than Doubling Reasoning Over 3 Pro**

Benchmark Performance: More Than Doubling Reasoning Over 3 Pro Google's published benchmarks tell a story of dramatic improvement, particularly in areas associated with reasoning and agentic capability. On ARC-AGI-2, a benchmark that evaluates a model's ability to solve novel abstract reasoning patterns, 3.1 Pro scored 77.1% -- more than double the 31.1% achieved by Gemini 3 Pro and substantially ahead of Anthropic's Sonnet 4.6 (58.3%) and Opus 4.6 (68.8%). On Humanity's Last Exam, a rigorous academic reasoning benchmark, 3.1 Pro achieved 44.4% without tools, up from 37.5% for 3 Pro and ahead of both Claude Sonnet 4.6 (33.2%) and Opus 4.6 (40.0%).

Google's Gemini 3.1 Pro arrives with three adjustable reasoning modes, a lightweight echo of its Deep Think system. The headline claim is clear: reasoning performance more than doubles compared to Gemini 3 Pro on the ARC‑AGI‑2 benchmark. Benchmarks show dramatic improvement in abstract reasoning and agentic capability, according to Google’s published data.

Yet the article notes that three months is a long time in AI, and competitors have not been idle. How these gains will manifest in everyday applications remains uncertain. The model’s “Deep Think Mini” label suggests a trade‑off between speed and depth, but the balance is not fully detailed.

Google’s own figures provide a snapshot, but independent verification is absent. Consequently, while the numbers are impressive on paper, the practical impact is still to be measured. The update underscores Google’s focus on adjustable cognition, but whether this approach will sustain its edge as other firms iterate is unclear.

Further Reading