Graphic illustrating metric-dependent annotation saturation in machine learning, showing label distribution analysis for impr

Editorial illustration for Metric-Dependent Annotation Saturation for Learning from Label Distributions

Metric-Dependent Annotation Saturation for Learning from...

Metric-Dependent Annotation Saturation for Learning from Label Distributions

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 23, 2026 • 2 min read

Why does the way we label data matter? Across five smoothing intensities, entropy‑based correlations hover between r ≈ 0.45 and 0.49, yet soft labels push that figure to r = 0.643 (p < 0.001). While the numbers look tidy, a per‑item dive shows smoothing can’t tell ambiguous items from clear ones, leaving a performance gap.

The soft‑label edge holds up on two architectures—DeBERTa and RoBERTa—on a non‑NLI‑pretrained baseline, and even in a cross‑domain test on content‑safety data. Here’s the thing: the findings imply that annotation budgets shouldn’t be set uniformly; they need to reflect the evaluation metric you care about. In a related vein, the November 15, 2022 NeurIPS workshop “I Can’t Believe It’s Not Better” introduced continuous soft pseudo‑labeling for speech recognition, noting that one‑hot labels force hard decision boundaries and invite overfitting.

Algorithms like slimIPL generate pseudo‑labels end‑to‑end, sidestepping the pitfalls of fixed, noisy labels. The proposed framework treats labels as mutable targets, aiming to regularize training without the brittleness of static annotations.

Metric-Dependent Annotation Saturation for Learning from Label Distributions AuthorsGuneet Kohli Metric-Dependent Annotation Saturation for Learning from Label Distributions AuthorsGuneet Kohli When annotators disagree on a label, the disagreement itself carries signalâand the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlationâwhether the model identifies which items elicit disagreementârequires N â 20â50 annotators to converge, while distributional match (KL divergence) saturates by N â 10 (87â95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate.

Metric-Dependent Annotation Saturation for Learning from Label Distributions - Apple Machine Learning Research

Why this matters

We see a concrete step toward treating annotator disagreement as data, not noise. By showing that the number of judgments required to capture that signal varies with the chosen evaluation metric, the authors challenge the one‑size‑fits‑all approach to crowdsourcing. Their experiments fine‑tuning NLI models on label distributions drawn from ChaosNLI—an archive of 100 independent judgments per item—demonstrate that subsampling can preserve useful disagreement information.

Yet the study is limited to NLI and a single dataset; it is unclear whether metric‑dependent saturation will hold for other tasks or larger, more diverse corpora. For developers, this suggests a potential to reduce annotation costs by tailoring collection depth to specific metrics, but the trade‑off between cost and fidelity remains uncertain. Researchers may need to revisit benchmark designs that assume a fixed annotator count.

Founders should watch for tools that embed this insight, while keeping an eye on whether the approach scales beyond the controlled conditions reported here.

Metric-Dependent Annotation Saturation for Learning from...

Further Reading

Latest News