Skip to main content
AI model Claude benchmarking study showing expert-level performance with 23 unresolved tasks in human-AI collaboration resear

Editorial illustration for Anthropic benchmark says Claude matches experts, 23 tasks remain ambiguous

Anthropic benchmark says Claude matches experts, 23...

Anthropic benchmark says Claude matches experts, 23 tasks remain ambiguous

2 min read

Anthropic has rolled out a fresh benchmark aimed at gauging Claude’s performance against seasoned bioinformatics specialists. The suite comprises 84 distinct tasks, spanning everything from protein‑fold prediction to pathway analysis. In the first run, Claude cleared 61 of those items, hitting the same accuracy levels that human experts reported in the original studies.

That result has sparked conversation among researchers who have long debated whether large language models can truly replicate domain‑specific reasoning. Yet the picture isn’t uniformly clear. A subset of the test set proved stubborn, prompting the company to flag them for further scrutiny.

The lingering uncertainty raises questions about the nature of those problems and the design of the evaluation itself. Are they hitting a hard ceiling, or merely exposing gaps in the current panel of evaluators? The answers could shape how future benchmarks are built and how we interpret AI‑human parity claims.

For the remaining 23, Anthropic acknowledges that it's unclear whether they are fundamentally unsolvable or just extremely difficult. Whether a larger or differently composed expert panel could have solved them also remains an open question. On the solvable problems, Claude now matches human expert.

For the remaining 23, Anthropic acknowledges that it's unclear whether they are fundamentally unsolvable or just extremely difficult.

Claude's performance on BioMysteryBench is noteworthy. Yet, the benchmark itself admits that 23 tasks remain ambiguous, and Anthropic concedes it is unclear whether those tasks are fundamentally unsolvable or merely exceptionally hard. Because the panel of human experts used for comparison may not represent the full breadth of expertise, whether a larger or differently composed panel could have solved the same problems stays an open question.

The authors argue that on the solvable problems Claude now matches human experts, suggesting a step toward practical utility in bioinformatics research. However, measuring AI competence in this domain remains fraught with blind spots, as existing tests like MMLU‑Pro or GPQA focus on factual recall rather than hands‑on research skill. Consequently, the claim of expert‑level capability should be weighed against the unresolved portion of the benchmark and the methodological limits acknowledged by Anthropic.

In short, the results are promising but tempered by significant uncertainties about the remaining tasks and the broader applicability of the findings.

Further Reading