Skip to main content
Benchmark results showing Errorquake-10k LLM evaluation with 10,000 AI responses scored on a 0-4 severity scale, illustrating

Editorial illustration for Errorquake-10k Benchmark Scores 10,000 LLM Responses on 0-4 Severity Scale

Errorquake-10k Benchmark Scores 10,000 LLM Responses on...

Errorquake-10k Benchmark Scores 10,000 LLM Responses on 0-4 Severity Scale

2 min read

Errorquake‑10k puts 10,000 LLM responses on a 0‑4 severity scale, spanning eight domains and five difficulty tiers. While most benchmarks count errors as a single number, this study shows that not all mistakes are created equal—a wrong date and a fabricated court ruling differ by orders of magnitude. At matched accuracy (|Δε| < 0.05), 85 of the 210 model pairs have disjoint 95 % bootstrap confidence intervals for the severity‑distribution index b; deepseek‑v3.2 versus ministral‑14b illustrate the gap (ε = 0.586, Δb = 0.47).

A three‑rater, 519‑item human validation yields ICC(2,k = 3) = 0.85, LLM‑judge ranking correlation ρ = 0.89, and a dense‑model scaling correlation ρₛ = ‑0.86. The authors prove a Non‑Reducibility Theorem (I(b; model | ε) = 1.56 bits), indicating that severity profiles carry information beyond error rates—64.5 % of b variance remains unexplained by ε. A taxonomy of error mechanisms (κ = 0.83) reveals a shift from retrieval‑type slips (71 % low‑severity) to fabrications (39 % high‑severity), a pattern that varies significantly with model size (p < 0.0001).

We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| < 0.05) on human-consensus scoring, e.g.

ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon).

A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) -- and this composition differs by model size (p < 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.

Why this matters

Do we really know how bad an LLM can get? Errorquake‑10k forces us to look beyond a single error rate, scoring each of 10,000 responses on a 0‑4 severity scale across eight domains and five difficulty tiers. At matched accuracy, the benchmark shows open‑weight models spread out in heavy‑tailed severity distributions, a pattern hidden from traditional hallucination tests that count errors as equal.

A misplaced date and a fabricated court ruling, for example, land on opposite ends of that scale, suggesting risk varies dramatically even when overall correctness appears similar. For 21 models the authors fit per‑model severity curves and compute a distribution index b, akin to a Gutenberg‑Richter metric, offering a quantitative glimpse into how often severe slips occur. This could help developers prioritize safety checks, and founders might weigh model choice against potential high‑impact failures.

Yet the study covers only open‑weight systems, eight domains, and a fixed query set; it remains unclear whether the same heavy‑tailed behavior holds for closed‑source offerings or in production workloads. We should treat these findings as a prompt to incorporate severity‑aware evaluation, while awaiting broader validation.

Further Reading