Illustration for: Gemini 3 Pro leads AI reliability benchmark, yet hallucination rates stay high
LLMs & Generative AI

Gemini 3 Pro leads AI reliability benchmark, yet hallucination rates stay high

2 min read

Gemini 3 Pro has just topped a newly released AI reliability benchmark, edging out its peers in raw accuracy. The test, designed to gauge how consistently large language models stick to factual ground, shows the Google‑backed system scoring higher than any other entrant. Yet the headline numbers hide a persistent problem: hallucinations.

Across the suite of models evaluated, the most common shortfall was generating information that isn’t grounded in reality, a flaw that dragged down overall performance. Gemini 3 Pro’s lead is notable, but it doesn’t erase the fact that every model, including the winner, still produces a sizable share of erroneous output. The researchers point to a clear pattern—bigger models tend to perform better on the benchmark—while also flagging that the high rate of hallucinations remains the primary weakness.

*The researchers interpret this as evidence of the model's large scale since accuracy in the benchmark strongly correlates with model size. Hallucination rates remain the main weakness.*

The researchers interpret this as evidence of the model's large scale since accuracy in the benchmark strongly correlates with model size. Hallucination rates remain the main weakness The study found that poor results across the board stem largely from high hallucination rates. Gemini 3 Pro achieved the highest overall accuracy at 53 percent, far ahead of previous leaders like GPT‑5.1 (high) and Grok 4, both at 39 percent.

But the model still showed an 88 percent hallucination rate, matching Gemini 2.5 Pro and Gemini 2.5 Flash. GPT‑5.1 (high) and Grok 4 were also high at 81 and 64 percent respectively, but Gemini 3 Pro went even further. Artificial Analysis concluded that while Gemini 3 Pro demonstrates greater factual coverage, its tendency to give wrong answers rather than admit uncertainty remains unchanged.

Related Topics: #AI #LLM #Gemini 3 Pro #GPT‑5.1 #Grok 4 #hallucination rates #benchmark

Gemini 3 Pro tops the new Omniscience Index, scoring 13 points. Yet the benchmark paints a mixed picture. While its accuracy outpaces Claude 4.1 Opus, GPT‑5.1 and Grok 4, the overall landscape remains troubling, as most of the 40 models tested failed to achieve a positive score.

Researchers link Gemini’s lead to its scale, noting that larger models tend to perform better on factual tasks. However, the study underscores that hallucinations continue to dominate error profiles. Even Gemini 3 Pro, despite its higher accuracy, exhibits a hallucination rate that the authors describe as a “main weakness.” The persistence of these errors raises questions about the practical reliability of current large language models.

Without clear evidence that hallucination frequencies are decreasing, the utility of even top‑scoring systems remains uncertain. Because the Omniscience Index measures only factual correctness, its high score does not guarantee that Gemini 3 Pro will perform reliably across diverse real‑world tasks where context and nuance matter. Further work will be needed to determine whether improvements in size can translate into genuinely trustworthy outputs.

Further Reading

Common Questions Answered

What accuracy percentage did Gemini 3 Pro achieve on the AI reliability benchmark, and how does it compare to GPT‑5.1 and Grok 4?

Gemini 3 Pro reached an overall accuracy of 53 percent, which is significantly higher than the 39 percent recorded by both GPT‑5.1 (high) and Grok 4. This gap demonstrates Gemini 3 Pro’s lead in factual correctness among the evaluated models.

Despite its top score, what hallucination rate did Gemini 3 Pro exhibit in the benchmark?

Gemini 3 Pro still generated hallucinations in 88 percent of its responses, indicating that most of its output was not grounded in reality. This high hallucination rate remains the primary weakness even for the highest‑scoring model.

How does model size relate to performance on the new Omniscience Index according to the researchers?

The researchers found a strong correlation between model size and benchmark accuracy, noting that larger models like Gemini 3 Pro tend to achieve higher factual scores. This relationship helped explain why Gemini 3 Pro topped the Omniscience Index with 13 points.

What overall trend did the study observe across the 40 models tested in terms of positive scores?

The study reported that most of the 40 evaluated models failed to achieve a positive score on the reliability benchmark, highlighting a widespread issue with hallucinations. Only a few models, including Gemini 3 Pro, managed to post a net positive result.