Illustration for: Gemini 3 Pro tops trust, ethics, safety at 69% vs 16% for Gemini 2.5
Research & Benchmarks

Gemini 3 Pro tops trust, ethics, safety at 69% vs 16% for Gemini 2.5

3 min read

The latest blinded tests put trust, ethics and safety front and center, asking a broad cross‑section of users to judge two versions of the same model. When the results came in, the gap between the newer release and its predecessor was stark. Gemini 3 Pro cleared the top spot far more often than Gemini 2.5 Pro, and it did so across every demographic slice the study examined.

In addition to the trust metrics, the evaluation covered four key areas, with the newer system taking first place in three of them—including performance and reasoning. Those numbers aren’t just a footnote; they point to a shift in how researchers are measuring real‑world reliability versus traditional academic benchmarks. If you’re watching AI’s progress, the contrast between 69 % and 16 % tells a story worth unpacking.

Advertisement

Gemini 3 now ranks number one overall in trust, ethics and safety 69% of the time across demographic subgroups, compared to its predecessor Gemini 2.5 Pro, which held the top spot only 16% of the time. Overall, Gemini 3 ranked first in three of four evaluation categories: performance and reasoning, interaction and adaptiveness and trust and safety. It lost only on communication style, where DeepSeek V3 topped preferences at 43%.

The Humane test also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, sex, ethnicity and political orientation. The evaluation also found that users are now five times more likely to choose the model in head-to-head blind comparisons. But the ranking matters less than why it won.

"It's the consistency across a very wide range of different use cases, and a personality and a style that appeals across a wide range of different user types," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred by either small subgroups or on a particular conversation type, it's the breadth of knowledge and the flexibility of the model across a range of different use cases and audience types that allowed it to win this particular benchmark." How blinded testing reveals what academic benchmarks miss HUMAINE's methodology exposes gaps in how the industry evaluates models.

Related Topics: #Gemini 3 Pro #Gemini 2.5 Pro #DeepSeek V3 #trust and safety #performance and reasoning #Phelim Bradley #Prolific #VentureBeat

Does a 69% trust rating guarantee broader acceptance? The Prolific study suggests Gemini 3 Pro outperforms its predecessor, achieving top marks in trust, ethics and safety across demographic subgroups, while Gemini 2.5 managed only 16% in the same metric. Yet the evaluation covers three of four categories, leaving the fourth unaddressed.

Because the test focuses on real‑world attributes rather than academic benchmarks, the results sidestep the usual vendor‑provided scores that Google touts. Still, the methodology behind Prolific’s blind testing remains unclear, and how the missing category might affect overall ranking is unknown. Moreover, the report does not detail absolute performance levels, only relative placement.

Consequently, while Gemini 3 Pro appears to lead in the measured dimensions, whether this translates into consistent user experience across all scenarios is uncertain. The data underscores a shift toward evaluating AI on trust and safety, but further independent assessments will be needed to confirm the model’s standing. Results don’t guarantee stability.

Further Reading

Common Questions Answered

How did Gemini 3 Pro perform in trust, ethics, and safety compared to Gemini 2.5 Pro?

Gemini 3 Pro secured the top spot 69% of the time across demographic subgroups, while Gemini 2.5 Pro achieved only a 16% top‑ranking rate. This stark difference highlights a significant improvement in perceived trustworthiness, ethical behavior, and safety for the newer model.

Which evaluation categories did Gemini 3 Pro win, and where did it fall short?

The model ranked first in three of the four categories: performance and reasoning, interaction and adaptiveness, and trust and safety. It lost the communication style category, where DeepSeek V3 was preferred by 43% of participants.

What does the Prolific study reveal about Gemini 3 Pro's acceptance across demographic groups?

The Prolific study found that Gemini 3 Pro consistently outperformed its predecessor across all demographic slices, achieving a 69% top‑ranking in trust, ethics, and safety. This suggests broader acceptance and confidence among diverse user populations.

Why does the article emphasize real‑world attributes over academic benchmarks?

Because the evaluation focuses on practical factors like trust, ethics, safety, and interaction style, it sidesteps vendor‑provided academic scores that Google typically highlights. This approach aims to reflect how the models perform in everyday user scenarios rather than theoretical metrics.

Advertisement