A close-up of a DeFAb Benchmark system enforcing strict polynomial-time checks for logical rigor in computational verificatio

Editorial illustration for DeFAb Benchmark Enforces Polynomial-Time Checks for Logical Rigor

DeFAb Benchmark Enforces Polynomial-Time Checks for...

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 18, 2026 • Updated: July 4, 2026 • 4 min read

Reasoning is the bottleneck. Not prose, not parameters, but the quiet, unforgiving machinery of logical structure. DeFAb, a new benchmark, directly weaponizes polynomial-time verifiability against that bottleneck.

Every hypothesis here must pass checks for valid derivation, conservativity, and minimality; logical rigor becomes the scoring metric for creativity itself. The dataset doesn’t just test, it enforces: 372,648 instances across 33.75 million materialized rules, drawn from 18 sources, layered in three tiers with gold standards a computer can confirm in polynomial time. The results are damning.

Four frontier models, asked to reason defeasibly, crumble. Level 2 accuracy ranges from 7.8% to 23.5%. Chain-of-thought variance, a staggering 36 percentage points, dwarfs any difference between models.

A matched contamination control reveals a +19.4 point gap at Level 3, suggesting that when the problem gets hard, the models don’t just stumble; they fall off a cliff. DeFAb-Hard ups the ante: 235 instances at maximal difficulty, where the best model achieves 53.3% against a symbolic system’s perfect score. CONJURE, a kernel-verified variant, swaps human judges for Lean 4’s proof kernel, gold answers are definitions the kernel never saw before.

A pilot found zero novel concepts. Zero. This is not a test of fluency.

It is a scalpel for measuring whether a model can build a theory without destroying it.

Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts).

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models - ArXiv AI (cs.AI)

The numbers speak for themselves: 7.8% to 23.5% accuracy on rendering-robust defeasible reasoning. Chain-of-thought variance that swamps any model-to-model gap. A contamination control that isolates a nineteen-point deficit.

These are not marginal failures. They are structural cracks in how frontier systems handle the logic of revision, the core of abduction. DeFAb does not merely measure those cracks.

It exposes them through a framework where every hypothesis must pass polynomial-time checks for validity, conservativity, and minimality. The benchmark turns logical rigor into a scoring instrument, not a philosophical aspiration. The three-tiered pipeline, the 33.75 million materialized rules, the kernel-verified CONJURE variant, all converge on a single demand: that creative reasoning earn its keep through verifiable derivation.

And yet the CONJURE pilot finds zero novel concepts. The symbolic verifier finds none. The best model on DeFAb-Hard scrapes 53.3% against a symbolic 100%.

The gap is not noise; it is a challenge. The benchmark is now published. The test is set.

The question is not whether models can talk about logic, it is whether they can be logical.

Common Questions Answered

What are the three main polynomial-time checks that DeFAb enforces on logical hypotheses?

DeFAb enforces three critical checks: valid derivation, conservativity, and minimality. These checks ensure that every hypothesis in the benchmark meets rigorous logical standards, making logical rigor the primary scoring metric rather than other factors like prose quality or model parameters.

How large is the DeFAb dataset and what sources does it draw from?

The DeFAb benchmark contains 372,648 instances across 33.75 million materialized rules, sourced from 18 different sources. This extensive dataset provides comprehensive coverage for testing defeasible reasoning across multiple domains and logical frameworks.

What do the accuracy results reveal about current frontier systems' defeasible reasoning capabilities?

Current frontier systems achieve only 7.8% to 23.5% accuracy on rendering-robust defeasible reasoning tasks within DeFAb. These results represent structural failures in how advanced models handle the logic of revision and abduction, indicating significant gaps beyond marginal performance issues.

Why is chain-of-thought variance problematic according to the DeFAb benchmark findings?

The chain-of-thought variance in DeFAb results is so substantial that it overwhelms any differences between individual models, suggesting that reasoning inconsistency is a fundamental issue across systems. This variance indicates that models struggle with consistent logical reasoning rather than having model-specific weaknesses.

What does the nineteen-point contamination control deficit reveal about model performance?

The contamination control mechanism in DeFAb isolates a nineteen-point performance deficit, demonstrating that data contamination or overfitting accounts for a significant portion of apparent model capabilities. This finding exposes that much of the gap in defeasible reasoning performance stems from fundamental logical limitations rather than training data issues.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

DeFAb Benchmark Enforces Polynomial-Time Checks for...

Common Questions Answered

What are the three main polynomial-time checks that DeFAb enforces on logical hypotheses?

How large is the DeFAb dataset and what sources does it draw from?

What do the accuracy results reveal about current frontier systems' defeasible reasoning capabilities?

Why is chain-of-thought variance problematic according to the DeFAb benchmark findings?

What does the nineteen-point contamination control deficit reveal about model performance?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

Sam Altman Addresses AI Alarm Over Autonomous Agents

Fender CEO Says Your Bandmates Are "Analog AI

Anthropic Cites OpenAI Breach in Testing Its AI Security

OpenAI Targets Production AI Agents for Customer Service

Meta AI’s Memory Coach Outperforms Constant Recall for Long Tasks

EU Rules Will Force AI Chatbots and Hotlines to Disclose Their Nature

AI tools flag thousands of flaws, but few get weaponized

AI Deletes Spreadsheet Data When Asked to Clean Entry

Claude Opus 5 Advances from Color Blocks to 3D Game Prototypes

METR Urges Independent AI Agent Investigations After Hugging Face Incident

Related Reading

Google's FACTS benchmark shows 70% factuality ceiling across four tests

Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK

Nvidia's DLSS 4.5 beta adds 6x Multi Frame Generation for RTX 50 GPUs

OpenAI researchers aim to forecast AI model failure rates pre‑launch

Nvidia AI Agent Trains Robots Autonomously, Editing Code from Papers

Common Questions Answered

What are the three main polynomial-time checks that DeFAb enforces on logical hypotheses?

How large is the DeFAb dataset and what sources does it draw from?

What do the accuracy results reveal about current frontier systems' defeasible reasoning capabilities?

Why is chain-of-thought variance problematic according to the DeFAb benchmark findings?

What does the nineteen-point contamination control deficit reveal about model performance?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

Sam Altman Addresses AI Alarm Over Autonomous Agents

Fender CEO Says Your Bandmates Are "Analog AI

Anthropic Cites OpenAI Breach in Testing Its AI Security

OpenAI Targets Production AI Agents for Customer Service

Meta AI’s Memory Coach Outperforms Constant Recall for Long Tasks

EU Rules Will Force AI Chatbots and Hotlines to Disclose Their Nature

AI tools flag thousands of flaws, but few get weaponized

AI Deletes Spreadsheet Data When Asked to Clean Entry

Claude Opus 5 Advances from Color Blocks to 3D Game Prototypes

METR Urges Independent AI Agent Investigations After Hugging Face Incident