Skip to main content
A person's hand points at a screen displaying a Google AI study on human disagreement and rater limitations.

Editorial illustration for Google study: AI benchmarks ignore human disagreement; under 10 raters fail

AI Benchmarks Flawed: Google Reveals Rating Bias

Google study: AI benchmarks ignore human disagreement; under 10 raters fail

2 min read

Google’s latest internal audit of AI evaluation methods raises a straightforward question: are we trusting too few human judgments when we compare models? The study, released this week, scrutinizes the way benchmark datasets are scored, pointing out that most tests rely on a handful of annotators—often just one to five per example. Researchers ran a massive grid of experiments, mixing different budget sizes with varying numbers of raters, to see how stable the results really are.

Their findings suggest that the conventional approach may be masking substantial disagreement among humans, which in turn can skew perceived performance differences between systems. If the underlying human feedback is noisy, any claim of superiority becomes shaky. The data show a clear pattern: as the rater count drops below a certain threshold, the reliability of the benchmark collapses, putting current practice in doubt.

All told, they tested thousands of combinations across different total budgets and rater counts per example. Fewer than ten raters per example isn't cutting it The results put current practice in question. The typical one to five raters per test example often aren't enough to make model comparisons reproducible, according to the study.

For statistically reliable results that actually capture the range of human opinion, you generally need more than ten raters per example. The experiments also show that reliable results can often be achieved with around 1,000 total annotations, but only if the budget is split correctly between test examples and raters.

Is ten raters enough? The study suggests it is the bare minimum for reliable AI benchmarks, a stark contrast to the three‑to‑five evaluators most teams still use. By testing thousands of budget‑rater configurations, the researchers discovered that around 1,000 annotations can yield stable results—provided the split between test items and annotators is carefully balanced.

Otherwise, the numbers become noisy, and model comparisons lose meaning. Consequently, the prevailing practice of assigning only one to five raters per example now appears questionable. Yet the paper does not prescribe a universal budget; it merely flags the trade‑off and leaves open how organizations will reallocate resources.

Moreover, while the findings are clear about the inadequacy of fewer than ten raters, they stop short of confirming whether ten is sufficient across all task types. The community must decide whether to adjust evaluation pipelines, and whether such changes will translate into more trustworthy performance metrics remains uncertain.

Further Reading

Common Questions Answered

How many raters does Google's study suggest are needed for reliable AI benchmark evaluations?

Google's research indicates that more than ten raters per test example are necessary for statistically reliable results. The study challenges current practices of using only one to five raters, demonstrating that such limited human input fails to capture the full range of human opinion and judgment.

What did Google's research reveal about the current methods of AI model comparisons?

The study found that typical AI benchmark evaluations using just three to five raters per example are insufficient for making reproducible model comparisons. By testing thousands of combinations across different budget sizes and rater counts, researchers discovered that around 1,000 annotations can yield stable results when carefully balanced.

Why are more human raters important in AI benchmark testing?

More human raters help capture the nuanced and diverse range of human opinions and judgments when evaluating AI models. The Google study emphasizes that fewer than ten raters can introduce significant noise and unreliability into benchmark comparisons, potentially leading to misleading conclusions about AI model performance.