Editorial illustration for Correlated errors cut panel accuracy 8‑22 points; top judge matches panel
Correlated errors cut panel accuracy 8‑22 points; top...
Correlated errors cut panel accuracy 8‑22 points; top judge matches panel
LLM‑as‑a‑judge panels promise to smooth out the quirks of any single model by pooling judgments from several front‑running systems. The idea is simple: if nine different models each cast a vote, the aggregate should be more trustworthy than any one alone. Guneet Kohli’s new study puts that assumption to the test.
By building a framework that quantifies the true informational contribution of each vote, the research measures how far real‑world panels fall short of the ideal of independent voting. The experiment runs nine frontier LLMs—spanning seven model families—against three natural‑language‑inference benchmarks, each annotated by a hundred humans per item. The result is stark.
Roughly three‑quarters of the panel’s nominal independence evaporates because the models repeatedly stumble on the same items. In practice, the nine judges deliver only about two independent votes’ worth of signal. The findings raise a clear question: how reliable are the large‑scale evaluation panels that many researchers now treat as de‑facto standards?
The consequences are stark: the panelâs actual accuracy falls 8â22 percentage points short of what independent voting would achieve, and the best single judge matches or outperforms the full panel across all conditions. Neither adding more judges nor using smarter aggregation algorithms helps â established methods close at most 11% of this gap, even with access to the correct answers. We quantify these findings using the Kish effective sample size (n_eff) and a Condorcet null model, and show the deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The bottleneck is correlated judges, not the aggregation algorithm, implying that scaling up panels cannot substitute for genuinely independent evaluation.
Why this matters
Can we trust panels of LLM judges to give us a clearer picture of model quality? The study shows they often can’t. Correlated errors drag the panel’s actual accuracy down by eight to twenty‑two percentage points compared with an ideal of independent voting.
Even more striking, the single best judge matches or outperforms the entire ensemble across every tested condition. Adding more judges—or swapping in fancier aggregation schemes—doesn’t rescue the gap. For developers, this means that scaling up judge counts may not yield the reliability we assumed.
Founders should reconsider investments in large‑scale evaluation pipelines that rely on sheer numbers rather than diversity of error. Researchers are left with an open question: how to break the correlation that ties these models together? Until we understand why the errors line up, our confidence in panel‑based benchmarks will stay tentative.
We’ll need new methods that either diversify model architectures or explicitly decorrelate judgments before panels can become a trustworthy tool.
Further Reading
- Correlated Errors Undermine LLM Evaluation Panels - arXiv
- Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels - Robotics Center AI
- Weak judges, strong panel: an ensemble approach to LLM eval - ORQ.ai
- LLM-as-a-Judge: Why Frontier Models Fail 50%+ Bias Tests - Adaline
- Correlated Errors in Large Language Models - OpenReview