Skip to main content
Editorial chart comparing Claude Fable 5 AI model performance, showing only 3% task success rate with 31 out of 91 failing ov

Editorial illustration for Benchmark shows Claude Fable 5 passes only 3% of tasks, 31 of 91 fail 50%

Benchmark shows Claude Fable 5 passes only 3% of tasks,...

Benchmark shows Claude Fable 5 passes only 3% of tasks, 31 of 91 fail 50%

2 min read

Artificial Analysis has rolled out a new benchmark called AA‑Briefcase, designed to test AI on multi‑week knowledge‑work projects. The suite strings together thousands of fragmented files—Slack threads, emails, meeting transcripts, massive data exports—so the tasks resemble real‑world information work rather than tidy prompts. Even the top model, Anthropic’s Claude Fable 5, clears just 3 percent of the rubric’s criteria across 91 tasks.

On 31 of those tasks, no model reaches even half of the required score. The study notes a shift in error patterns as models improve: weaker systems miss obvious files or produce unusable output, while stronger ones meet headline requirements but slip on details that only emerge when pieces are cross‑referenced. Cost also varies dramatically.

Per‑task pricing ranges from roughly $0.04 for DeepSeek V4 Flash to more than $31 for Claude Fable 5, an 800‑fold difference. It's a question how ready current AI is for the kind of nuanced, source‑heavy work that businesses actually need.

The top performer, Claude Fable 5, hits the highest rubric pass rate but still nails all criteria on just 3 percent of tasks. On 31 out of 91 tasks, no model even clears 50 percent. The types of errors shift as models get better.

Weaker models choke on basic execution as they miss relevant files or spit out unusable results. Stronger models fail more quietly, as they hit the obvious requirements but miss details you'd only catch by piecing together information from multiple sources. There also is a significant price gap: Per-task costs span more than 800x, from about $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5.

Why this matters

The AA‑Briefcase results pull back the curtain on what many of us assumed: even the most advanced models still stumble on day‑to‑day knowledge work. Claude Fable 5, the current leader, clears just three percent of the tasks, and on 31 of the 91 assignments every model falls short of a 50 percent pass rate. That gap isn’t a one‑off glitch; the benchmark shows error patterns evolving as models improve—early versions miss obvious files, later ones falter on nuanced synthesis.

For developers, this signals that building reliable pipelines will require more than scaling model size. Founders should temper expectations about AI‑driven automation of complex projects until these gaps shrink. Researchers are left with a concrete yardstick to target, but the path to consistent, real‑world performance remains uncertain.

We can appreciate the incremental progress, yet we must stay skeptical about claims that current systems are ready for unassisted knowledge‑intensive roles. The data urges a cautious, evidence‑first approach as we design the next generation of AI tools.

Further Reading