Skip to main content
AI models crossing a threshold, illustrating frontier model correctness across categories.

Editorial illustration for Open models cross threshold; frontier models show per‑category correctness

Open AI Models Breakthrough: Category-Specific Performance

Open models cross threshold; frontier models show per‑category correctness

2 min read

Why does this matter now? Because the latest benchmark run shows a clear split in how open‑source and commercial systems handle category‑specific tasks. While frontier models—those pushing the limits of scale—are being evaluated side by side, the results are already telling a story.

Gemini 3+ tops the chart with a “high” rating, OpenAI lands in the “medium” bracket, and Claude trails without the benefit of extended thinking. The CI run lets you click each model name for a deeper dive, and a DIY section lets anyone reproduce the tests. Here’s the thing: the per‑category correctness scores are the first public signal that open models have moved beyond the experimental phase and are now competing on the same footing as proprietary giants.

The partnership signals more than just a data point; it hints at a shift in how researchers and developers will measure progress.

Open models

Open models View CI run (click model names to view individual evals) Per-category correctness: Frontier models View CI run (click model names to view individual evals) Per-category correctness: For Gemini 3+, this is high For OpenAI, this is medium For Claude, this is without extended thinking DIY: Run Deep Agent evals locally Our CI runs the same evaluation suite across 52 models organized into groups -- including an open group (baseten:zai-org/GLM-5 , ollama:minimax-m2.7:cloud , ollama:nemotron-3-super ) that runs on every eval workflow. You can target any model group: Run evals against all open models: pytest tests/evals --model-group open Run against a specific model: pytest tests/evals --model baseten:zai-org/GLM-5 This makes it straightforward to compare open models against each other and against closed frontier models on the same tasks, using the same grading criteria.

Did the recent evaluations finally prove that open‑weight LLMs can stand shoulder‑to‑shoulder with closed frontier models? The Deep Agents harness runs over the past weeks suggest they have. GLM‑5 from z.ai and MiniMax M2.7 each posted scores comparable to the leading proprietary systems on core agent tasks—file manipulation, tool use, and instruction following.

That similarity is notable, yet the data cover only the initial set of evaluations, so broader generalisation remains uncertain. Per‑category correctness tables show a mixed picture among frontier offerings: Gemini 3+ registers high performance, OpenAI lands in the medium range, and Claude performs without extended thinking. The open‑model results line up with those frontier scores, but the granularity of the per‑category breakdown for the open models is not detailed in the summary.

Consequently, while the threshold claim appears supported, it is unclear whether open models will maintain parity across all categories or under more demanding scenarios. Further testing will be needed to confirm durability of the gains.

Further Reading

Common Questions Answered

How do frontier models like Gemini 3+ compare in per-category correctness evaluations?

Gemini 3+ tops the benchmark chart with a 'high' rating, while OpenAI lands in the 'medium' bracket and Claude trails without extended thinking capabilities. The evaluations cover 52 models across different groups, providing a comprehensive comparison of model performance.

What evidence suggests open-weight LLMs can compete with closed frontier models?

Recent Deep Agents evaluations show that open models like GLM-5 from z.ai and MiniMax M2.7 have posted scores comparable to leading proprietary systems in core agent tasks such as file manipulation, tool use, and instruction following. However, the data covers only an initial set of evaluations, so broader generalization remains uncertain.

What makes the current benchmark run significant for AI model comparisons?

The benchmark run reveals a clear split in how open-source and commercial systems handle category-specific tasks, allowing for detailed side-by-side evaluations of frontier models. The CI run enables users to click on each model name for a deeper dive into individual performance metrics across different evaluation categories.