A person's hand points at a screen displaying a Google AI study on human disagreement and rater limitations.

Editorial illustration for Google study: AI benchmarks ignore human disagreement; under 10 raters fail

AI Benchmarks Flawed: Google Reveals Rating Bias

Google study: AI benchmarks ignore human disagreement; under 10 raters fail

April 5, 2026 • 2 min read

Google’s latest internal audit of AI evaluation methods raises a straightforward question: are we trusting too few human judgments when we compare models? The study, released this week, scrutinizes the way benchmark datasets are scored, pointing out that most tests rely on a handful of annotators—often just one to five per example. Researchers ran a massive grid of experiments, mixing different budget sizes with varying numbers of raters, to see how stable the results really are.

Their findings suggest that the conventional approach may be masking substantial disagreement among humans, which in turn can skew perceived performance differences between systems. If the underlying human feedback is noisy, any claim of superiority becomes shaky. The data show a clear pattern: as the rater count drops below a certain threshold, the reliability of the benchmark collapses, putting current practice in doubt.

All told, they tested thousands of combinations across different total budgets and rater counts per example. Fewer than ten raters per example isn't cutting it The results put current practice in question. The typical one to five raters per test example often aren't enough to make model comparisons reproducible, according to the study.

For statistically reliable results that actually capture the range of human opinion, you generally need more than ten raters per example. The experiments also show that reliable results can often be achieved with around 1,000 total annotations, but only if the budget is split correctly between test examples and raters.

AI benchmarks systematically ignore how humans disagree, Google study finds - THE DECODER

Is ten raters enough? The study suggests it is the bare minimum for reliable AI benchmarks, a stark contrast to the three‑to‑five evaluators most teams still use. By testing thousands of budget‑rater configurations, the researchers discovered that around 1,000 annotations can yield stable results—provided the split between test items and annotators is carefully balanced.

Otherwise, the numbers become noisy, and model comparisons lose meaning. Consequently, the prevailing practice of assigning only one to five raters per example now appears questionable. Yet the paper does not prescribe a universal budget; it merely flags the trade‑off and leaves open how organizations will reallocate resources.

Moreover, while the findings are clear about the inadequacy of fewer than ten raters, they stop short of confirming whether ten is sufficient across all task types. The community must decide whether to adjust evaluation pipelines, and whether such changes will translate into more trustworthy performance metrics remains uncertain.

Common Questions Answered

How many raters does Google's study suggest are needed for reliable AI benchmark evaluations?

Google's research indicates that more than ten raters per test example are necessary for statistically reliable results. The study challenges current practices of using only one to five raters, demonstrating that such limited human input fails to capture the full range of human opinion and judgment.

What did Google's research reveal about the current methods of AI model comparisons?

The study found that typical AI benchmark evaluations using just three to five raters per example are insufficient for making reproducible model comparisons. By testing thousands of combinations across different budget sizes and rater counts, researchers discovered that around 1,000 annotations can yield stable results when carefully balanced.

Why are more human raters important in AI benchmark testing?

More human raters help capture the nuanced and diverse range of human opinions and judgments when evaluating AI models. The Google study emphasizes that fewer than ten raters can introduce significant noise and unreliability into benchmark comparisons, potentially leading to misleading conclusions about AI model performance.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

AI Benchmarks Flawed: Google Reveals Rating Bias

Further Reading

Common Questions Answered

How many raters does Google's study suggest are needed for reliable AI benchmark evaluations?

What did Google's research reveal about the current methods of AI model comparisons?

Why are more human raters important in AI benchmark testing?

Most Popular

Google Vids adds Veo, Lyria AI models and directable avatars for flyers, reels

Meta's structured prompting lifts LLM code review accuracy to 93%

Nvidia unveils Agentforce AI platform with Adobe, Salesforce, SAP at GTC 2026

Anthropic ends free OpenClaw access to Claude, adds extra fee April 4

Batch Mode VC-6 and NVIDIA Nsight Speed Up Vision AI Pipelines

Greg Brockman says GPT reasoning models have line of sight to AGI

Anthropic's Claude Code includes Kairos daemon that runs after window closes

Elgato adds MCP support in Stream Deck 7.4 update, enabling new trigger method

Know3D uses image model with Qwen2.5‑VL to edit hidden sides of 3D objects

Self-healing agents monitor post-deploy errors in production

Further Reading

Related Reading

Hyperparameter Tuning Reaches 0.9617 Accuracy in 64.59 Seconds

Pharma Cautious as AI Promises Faster Drug Discovery and Smarter Trials

Google AI Advisors Let Users Probe Performance with Conversational “Why” Queries

Gemini 3 Pro builds screenshot-to-code app in two prompts, fixes bugs

Gemini 3 Pro and GPT-5 stumble on graduate-level physics benchmark

Alibaba's Qwen team adds method that lengthens AI answers, prompting reasoning

Open models cross threshold; frontier models show per‑category correctness

Google Vids adds Veo, Lyria AI models and directable avatars for flyers, reels

Google releases Gemma 4 under Apache 2.0, noting lower memory, near‑zero latency

Common Questions Answered

How many raters does Google's study suggest are needed for reliable AI benchmark evaluations?

What did Google's research reveal about the current methods of AI model comparisons?

Why are more human raters important in AI benchmark testing?

Most Popular

Google Vids adds Veo, Lyria AI models and directable avatars for flyers, reels

Meta's structured prompting lifts LLM code review accuracy to 93%

Nvidia unveils Agentforce AI platform with Adobe, Salesforce, SAP at GTC 2026

Anthropic ends free OpenClaw access to Claude, adds extra fee April 4

Batch Mode VC-6 and NVIDIA Nsight Speed Up Vision AI Pipelines

Greg Brockman says GPT reasoning models have line of sight to AGI

Anthropic's Claude Code includes Kairos daemon that runs after window closes

Elgato adds MCP support in Stream Deck 7.4 update, enabling new trigger method

Know3D uses image model with Qwen2.5‑VL to edit hidden sides of 3D objects

Self-healing agents monitor post-deploy errors in production