Skip to main content
Tech analyst in a dim office points at a large monitor displaying Gemini 3 Pro benchmark graphs and error charts.

Gemini 3 Pro leads AI reliability benchmark, yet hallucination rates stay high

2 min read

Gemini 3 Pro just slipped past a fresh AI reliability benchmark, nudging ahead of the pack in raw accuracy. The test was built to see how often large language models stay on factual ground, and the Google-backed system ended up with the top score. Still, the headline numbers mask a nagging issue: hallucinations.

Across the models we looked at, the most frequent slip was spouting details that have no basis in reality, which pulled the overall marks down. Gemini 3 Pro’s edge is clear, but it doesn’t wipe away the fact that every model - winner included - still churns out a noticeable slice of wrong output. Researchers noticed a pattern - bigger models usually do better on this metric - yet they also flag the high hallucination rate as the biggest weakness.

*The team reads the results as a sign of scale: accuracy on the benchmark climbs with model size, while hallucinations stay the Achilles’ heel, and suggests we still have work to do.…

The researchers interpret this as evidence of the model's large scale since accuracy in the benchmark strongly correlates with model size. Hallucination rates remain the main weakness The study found that poor results across the board stem largely from high hallucination rates. Gemini 3 Pro achieved the highest overall accuracy at 53 percent, far ahead of previous leaders like GPT‑5.1 (high) and Grok 4, both at 39 percent.

But the model still showed an 88 percent hallucination rate, matching Gemini 2.5 Pro and Gemini 2.5 Flash. GPT‑5.1 (high) and Grok 4 were also high at 81 and 64 percent respectively, but Gemini 3 Pro went even further. Artificial Analysis concluded that while Gemini 3 Pro demonstrates greater factual coverage, its tendency to give wrong answers rather than admit uncertainty remains unchanged.

Related Topics: #AI #LLM #Gemini 3 Pro #GPT‑5.1 #Grok 4 #hallucination rates #benchmark

Gemini 3 Pro sits at the top of the new Omniscience Index with a score of 13, but the picture isn’t all sunshine. Its accuracy beats Claude 4.1 Opus, GPT-5.1 and Grok 4, yet almost all of the 40 models we looked at failed to post a positive score. The authors point to size as a likely factor - bigger models usually do better on straight-forward facts.

Still, hallucinations dominate the error list. Even Gemini 3 Pro, for all its accuracy, shows a hallucination rate the researchers call a “main weakness.” That lingering problem makes me wonder how reliable any of these large language models really are. We don’t have clear proof that hallucination rates are dropping, so the usefulness of even the highest-scoring systems stays fuzzy.

Since the Omniscience Index only checks factual correctness, a high number doesn’t guarantee Gemini 3 Pro will handle real-world tasks that need context or nuance. I think more work is needed to see if simply getting bigger will actually give us trustworthy output.

Further Reading

Common Questions Answered

What accuracy percentage did Gemini 3 Pro achieve on the AI reliability benchmark, and how does it compare to GPT‑5.1 and Grok 4?

Gemini 3 Pro reached an overall accuracy of 53 percent, which is significantly higher than the 39 percent recorded by both GPT‑5.1 (high) and Grok 4. This gap demonstrates Gemini 3 Pro’s lead in factual correctness among the evaluated models.

Despite its top score, what hallucination rate did Gemini 3 Pro exhibit in the benchmark?

Gemini 3 Pro still generated hallucinations in 88 percent of its responses, indicating that most of its output was not grounded in reality. This high hallucination rate remains the primary weakness even for the highest‑scoring model.

How does model size relate to performance on the new Omniscience Index according to the researchers?

The researchers found a strong correlation between model size and benchmark accuracy, noting that larger models like Gemini 3 Pro tend to achieve higher factual scores. This relationship helped explain why Gemini 3 Pro topped the Omniscience Index with 13 points.

What overall trend did the study observe across the 40 models tested in terms of positive scores?

The study reported that most of the 40 evaluated models failed to achieve a positive score on the reliability benchmark, highlighting a widespread issue with hallucinations. Only a few models, including Gemini 3 Pro, managed to post a net positive result.