Editorial illustration for Anthropic's Sonnet 4.6 hits 79.6% on SWE-bench, costs one‑fifth of Opus
Claude Opus 4.6: 1M Tokens, Agent Teams, AI Coding Leap
Anthropic's Sonnet 4.6 hits 79.6% on SWE-bench, costs one‑fifth of Opus
Why does this matter? Because Anthropic just put a price tag on flagship‑level coding ability. Sonnet 4.6, the company’s latest model, claims to deliver performance that rivals its higher‑priced sibling, Opus 4.6, while consuming only a fifth of the compute budget.
That cost gap could tip the scales for businesses weighing AI‑driven development tools against traditional engineering spend. While the numbers sound impressive on paper, the real test is whether the model holds up on benchmarks that matter to developers today. The SWE‑bench Verified suite, widely used to gauge real‑world software coding skill, and the OSWorld‑Verified agentic computer‑use test are two such yardsticks.
If Sonnet 4.6 can stay competitive on those fronts, the economics of AI‑assisted coding may shift dramatically. Below, Anthropic’s own benchmark table lays out the details.
The benchmark table Anthropic released paints a striking picture. On SWE-bench Verified, the industry-standard test for real-world software coding, Sonnet 4.6 scored 79.6% -- nearly matching Opus 4.6's 80.8%. On agentic computer use (OSWorld-Verified), Sonnet 4.6 scored 72.5%, essentially tied with Opus 4.6's 72.7%.
On office tasks (GDPval-AA Elo), Sonnet 4.6 actually scored 1633, surpassing Opus 4.6's 1606. On agentic financial analysis, Sonnet 4.6 hit 63.3%, beating every model in the comparison, including Opus 4.6 at 60.1%. In many of the categories enterprises care about most, Sonnet 4.6 matches or beats models that cost five times as much to run.
An enterprise running an AI agent that processes 10 million tokens per day was previously forced to choose between inferior results at lower cost or superior results at rapidly scaling expense. In Claude Code, early testing found that users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Users even preferred Sonnet 4.6 to Opus 4.5, Anthropic's frontier model from November, 59% of the time.
They rated Sonnet 4.6 as significantly less prone to over-engineering and "laziness," and meaningfully better at instruction following.
Will enterprises shift to Sonnet 4.6? The numbers suggest a compelling case. Scoring 79.6% on SWE‑bench, the model trails Opus 4.6 by just 1.2 points while costing only a fifth of the price, a gap that could influence budgeting decisions.
Its 72.5% result on OSWorld‑Verified shows parity in agentic computer use, reinforcing the claim of near‑flagship capability across coding, long‑context reasoning, and design tasks. Yet the benchmark table alone can’t confirm real‑world performance under diverse workloads, and the beta status of the 1 million‑token context window leaves its stability unproven. Anthropic’s positioning of Sonnet 4.6 as the default model signals confidence, but adoption will depend on how firms evaluate cost savings against any potential trade‑offs in reliability or support.
The upgrade across multiple domains is notable, but whether it will translate into broader corporate uptake remains uncertain. For now, the data present a clear, if cautious, indication that a lower‑cost alternative can approach flagship metrics without overtly compromising key benchmarks.
Further Reading
- Claude Sonnet 5 vs Opus 4.6 - Byte Bot - Byte Bot
- Claude Opus 4.6 vs 4.5 Benchmarks (Explained) - Vellum
- Anthropic Claude Opus 4.6: Is the Upgrade Worth It? - Codecademy
- Introducing Claude Opus 4.6 - Anthropic
Common Questions Answered
How does Claude Opus 4.5 perform on the SWE-bench Verified benchmark?
Claude Opus 4.5 achieved an unprecedented 80.9% performance on the SWE-bench Verified benchmark, which is the first AI model to exceed 80% and surpass all human engineering candidates. This milestone represents a significant breakthrough in AI coding capabilities, outperforming competitors like GPT-5.1 (74.2%) and Gemini 3 Pro (71.8%).
What makes Claude Opus 4.5's pricing unique in the AI coding assistant market?
Claude Opus 4.5 is priced at $5 per million input tokens and $25 per million output tokens, which represents a 66% reduction from previous pricing models. Additional cost savings are available through prompt caching (up to 90%) and batch processing (50%), making advanced AI coding capabilities more accessible to a broader range of developers and enterprises.
What are the key technical innovations in Claude Opus 4.5?
The model introduces several technical innovations, including new compression algorithms that reduce input requirements by 30% while maintaining quality, and an innovative 'effort' parameter that allows developers to adjust reasoning intensity. Additionally, the model provides native-level support for multiple programming languages including Python, JavaScript, TypeScript, Java, C++, Go, and Rust.