Benchmark results showing Errorquake-10k LLM evaluation with 10,000 AI responses scored on a 0-4 severity scale, illustrating

Editorial illustration for Errorquake-10k Benchmark Scores 10,000 LLM Responses on 0-4 Severity Scale

Errorquake-10k Benchmark Scores 10,000 LLM Responses on...

Q: What is the 0-4 severity scale used in the Errorquake-10k benchmark?

The Errorquake-10k benchmark scores LLM responses on a continuous 0-4 severity scale that distinguishes between different types of errors based on their impact and consequences. This approach moves beyond simple accuracy metrics by recognizing that not all mistakes are equal—for example, a model misremembering a date is fundamentally different from one fabricating false medical information.

Q: How does the Errorquake-10k benchmark differ from traditional accuracy metrics?

Unlike traditional accuracy metrics that flatten all mistakes into a single number, Errorquake-10k evaluates 10,000 queries across eight domains and five difficulty tiers using a severity scale. The benchmark reveals that 64.5% of the variation in how severely a model fails is independent of how often it fails, demonstrating that a model with acceptable average accuracy can still harbor a dangerously heavy tail of severe fabrications.

Q: What does the Gutenberg-Richter upper-tail slope (b value) measure in this benchmark?

The Gutenberg-Richter upper-tail slope is a statistical measure applied to the severity distributions of 21 open-weight models to quantify the distribution of severe errors. This metric helps identify which models have disproportionately large numbers of critical failures, providing insights into the tail behavior of errors rather than just average performance.

Q: What does the Non-Reducibility Theorem reveal about LLM error patterns?

The Non-Reducibility Theorem, central to the Errorquake-10k findings, demonstrates that 64.5% of variation in error severity is independent of error frequency. This means that reducing the number of errors a model makes does not necessarily reduce the severity of the errors it does make, challenging the assumption that improving accuracy alone will prevent catastrophic failures.

Q: Why is reporting only accuracy insufficient for evaluating LLM safety?

Reporting accuracy alone obscures critical differences in how models fail, potentially hiding dangerous patterns of severe fabrications behind acceptable average scores. The Errorquake-10k benchmark shows that two models with similar accuracy rates can have vastly different error severity distributions, with one failing only on trivial retrievals while another produces severe hallucinations in critical domains.

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 5, 2026 • Updated: July 4, 2026 • 5 min read

Accuracy alone is a lie. It flattens every mistake into a single number, hiding the gulf between a model misremembering a date and one concocting a false patient history. The Errorquake-10k benchmark shatters that illusion.

With 10,000 queries scored on a continuous 0–4 severity scale across eight domains and five difficulty tiers, it forces a reckoning: for 21 open-weight models, we fit severity distributions and compute a Gutenberg-Richter upper-tail slope, \( b \). The headline is damning: across 210 model pairs, 85 have entirely disjoint 95% confidence intervals for \( b \), even when matched for accuracy within a hair (|Δepsilon| < 0.05). A 519-item human validation study seals the case, ICC(2,k=3) = 0.85, rho = 0.89 with LLM judges, and dense-model scaling at rho_s = -0.86.

Then comes the proof: a Non-Reducibility Theorem, showing severity profile and error rate are informationally non-redundant. Sixty-four point five percent of cross-model variance in \( b \) remains unexplained by error rate. A severity mechanism taxonomy (kappa = 0.83) exposes the categorical shift: low-severity failures are retrievals (71%); high-severity are fabrications (39%), and composition depends on model scale (p < 0.0001).

Error rate is a whisper. Severity distribution is the roar.

We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| < 0.05) on human-consensus scoring, e.g.

ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon).

A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) -- and this composition differs by model size (p < 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models - ArXiv Machine Learning

The Errorquake-10k benchmark does not merely rank models. It dismantles the assumption that a single error rate tells the whole story. A model with a passable accuracy score can harbor a dangerously heavy tail of severe fabrications.

Another, with a slightly worse average, might fail only on trivial retrievals. The Non-Reducibility Theorem makes this explicit: 64.5% of the variation in how severely a model fails is independent of how often it fails. Reporting accuracy alone is like judging a building’s safety by its height while ignoring the fault line beneath.

The data is stark. Low-severity errors are dominated by retrieval failures, the model simply didn’t find the right fact. High-severity errors are fabrications, confident inventions that poison the well.

And the composition of these errors shifts dramatically with model size, a categorical difference that accuracy metrics flatten into noise. The severity distribution index, b, captures this heavy-tailed reality. It is not a footnote to accuracy; it is a separate, irreducible dimension of model behavior.

For practitioners, the implication is immediate: a model that scores well on a multiple-choice test may be catastrophically unreliable in open-ended generation. For the field, the message is clear. The error rate is a single number.

The severity distribution is a fingerprint. We have shown that these two quantities share only 35.5% of their variance. The rest is unique, discriminative, and actionable.

Stop asking only how often a model errs. Start asking how badly it errs when it does.

Common Questions Answered

What is the 0-4 severity scale used in the Errorquake-10k benchmark?

The Errorquake-10k benchmark scores LLM responses on a continuous 0-4 severity scale that distinguishes between different types of errors based on their impact and consequences. This approach moves beyond simple accuracy metrics by recognizing that not all mistakes are equal—for example, a model misremembering a date is fundamentally different from one fabricating false medical information.

How does the Errorquake-10k benchmark differ from traditional accuracy metrics?

Unlike traditional accuracy metrics that flatten all mistakes into a single number, Errorquake-10k evaluates 10,000 queries across eight domains and five difficulty tiers using a severity scale. The benchmark reveals that 64.5% of the variation in how severely a model fails is independent of how often it fails, demonstrating that a model with acceptable average accuracy can still harbor a dangerously heavy tail of severe fabrications.

What does the Gutenberg-Richter upper-tail slope (b value) measure in this benchmark?

The Gutenberg-Richter upper-tail slope is a statistical measure applied to the severity distributions of 21 open-weight models to quantify the distribution of severe errors. This metric helps identify which models have disproportionately large numbers of critical failures, providing insights into the tail behavior of errors rather than just average performance.

What does the Non-Reducibility Theorem reveal about LLM error patterns?

The Non-Reducibility Theorem, central to the Errorquake-10k findings, demonstrates that 64.5% of variation in error severity is independent of error frequency. This means that reducing the number of errors a model makes does not necessarily reduce the severity of the errors it does make, challenging the assumption that improving accuracy alone will prevent catastrophic failures.

Why is reporting only accuracy insufficient for evaluating LLM safety?

Reporting accuracy alone obscures critical differences in how models fail, potentially hiding dangerous patterns of severe fabrications behind acceptable average scores. The Errorquake-10k benchmark shows that two models with similar accuracy rates can have vastly different error severity distributions, with one failing only on trivial retrievals while another produces severe hallucinations in critical domains.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

Errorquake-10k Benchmark Scores 10,000 LLM Responses on...

Common Questions Answered

What is the 0-4 severity scale used in the Errorquake-10k benchmark?

How does the Errorquake-10k benchmark differ from traditional accuracy metrics?

What does the Gutenberg-Richter upper-tail slope (b value) measure in this benchmark?

What does the Non-Reducibility Theorem reveal about LLM error patterns?

Why is reporting only accuracy insufficient for evaluating LLM safety?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

MCP's new authorization protocols make it "enterprise ready

Anthropic says shared Claude chats may appear on Google

Nadella: Companies betting on single AI model risk failure

Anthropic's Amodei: Open-weight models less concerning than Chinese AI threat

Nonprofit Avoids AI Funding to Preserve Independence

ChatGPT now blocks requests to mimic writers like Rowling and Tan

Hugging Face Used to Undress Women and Children, Nonprofit Says

OpenAI Models Exploit Hugging Face Zero-Day to Run Malicious Code

OpenAI Says Workers Use ChatGPT for 'Task Crossover' Jobs

Kimi AI Open Sources 'AgentENV' Distributed System for Agent Training

Related Reading

ChatGPT's 'Nerdy' tweak rewards goblin metaphors in answers, study finds

Google tests visual 'magazine-style' UI for Gemini 3 Pro users

AI Engineers Face Rising Costs, Need New Strategies for Efficiency

Three SpaCy Tricks Speed Up Production-Grade Text Processing

Zhipu AI employs Muon Optimizer and Muon Split in GLM-4.5 and GLM-5 pretraining

Common Questions Answered

What is the 0-4 severity scale used in the Errorquake-10k benchmark?

How does the Errorquake-10k benchmark differ from traditional accuracy metrics?

What does the Gutenberg-Richter upper-tail slope (b value) measure in this benchmark?

What does the Non-Reducibility Theorem reveal about LLM error patterns?

Why is reporting only accuracy insufficient for evaluating LLM safety?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

MCP's new authorization protocols make it "enterprise ready

Anthropic says shared Claude chats may appear on Google

Nadella: Companies betting on single AI model risk failure

Anthropic's Amodei: Open-weight models less concerning than Chinese AI threat

Nonprofit Avoids AI Funding to Preserve Independence

ChatGPT now blocks requests to mimic writers like Rowling and Tan

Hugging Face Used to Undress Women and Children, Nonprofit Says

OpenAI Models Exploit Hugging Face Zero-Day to Run Malicious Code

OpenAI Says Workers Use ChatGPT for 'Task Crossover' Jobs

Kimi AI Open Sources 'AgentENV' Distributed System for Agent Training