Analysis overhauls AI Index; GPT-5.2 beats professionals on 70.9% of tasks
Artificial Analysis has just reshaped its AI Intelligence Index, swapping out the old benchmark suite for a set of “real‑world” tests. The move follows a broader push to gauge language models on tasks that mirror everyday professional work rather than abstract puzzles. While the new framework promises a clearer picture of how models perform on concrete jobs, the numbers it produces are already sparking conversation.
OpenAI’s latest release, GPT‑5.2, was run through the original GDPval evaluation—a benchmark that pits the model against seasoned practitioners across a range of occupations. The results, released alongside the index overhaul, suggest the system is not merely competitive but often ahead of human experts. This claim, backed by a detailed breakdown of performance across 44 distinct roles, is meant to demonstrate that the model can handle well‑specified knowledge work at scale.
The stakes are high: if the figures hold up, they could reshape expectations about what AI can reliably do in professional settings.
On the original GDPval evaluation, GPT-5.2 beat or tied top industry professionals on 70.9% of well‑specified tasks, according to OpenAI. The company claims GPT-5.2 "outperforms industry professionals at well‑specified knowledge work tasks spanning 44 occupations," with companies including Notion, B.
On the original GDPval evaluation, GPT-5.2 beat or tied top industry professionals on 70.9% of well-specified tasks, according to OpenAI. The company claims GPT-5.2 "outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations," with companies including Notion, Box, Shopify, Harvey, and Zoom observing "state-of-the-art long-horizon reasoning and tool-calling performance." The emphasis on economically measurable output is a philosophical shift in how the industry thinks about AI capability. Rather than asking whether a model can pass a bar exam or solve competition math problems -- achievements that generate headlines but don't necessarily translate to workplace productivity -- the new benchmarks ask whether AI can actually do jobs. Graduate-level physics problems expose the limits of today's most advanced AI models While GDPval-AA measures practical productivity, another new evaluation called CritPT reveals just how far AI systems remain from true scientific reasoning.
Does the new Intelligence Index finally solve the measurement problem? Artificial Analysis says its v4.0 framework, built on ten real‑world evaluations, moves beyond the quickly outdated benchmarks that have long plagued the field. Yet the shift raises questions about consistency: how will the new tests compare to legacy scores across diverse models?
GPT‑5.2’s performance, according to OpenAI, beat or tied top industry professionals on 70.9 % of well‑specified tasks in the original GDPval evaluation. The company also claims the model outperforms professionals across 44 occupations, with early adopters such as Notion mentioned. Still, the data come from a single vendor’s reporting, and the broader community has not yet validated the results independently.
Moreover, the definition of “well‑specified” remains vague, leaving it unclear whether the tasks capture the full complexity of real‑world work. Artificial Analysis’ overhaul may provide a more nuanced picture, but whether it will become the de‑facto standard for AI progress is uncertain. For now, the numbers invite cautious optimism tempered by the need for transparent verification.
Further Reading
- GPT-5.2 Review: Benchmarks (AIME 100%), Visual AI ... - Vertu
- GPT-5.2 Benchmarks - Vellum AI
- How GPT-5.2 stacks up against Gemini 3.0 and Claude Opus 4.5 - RDWorld Online
- GPT-5.2: Pricing, Context Window, Benchmarks, and More - LLM Stats
Common Questions Answered
How does the new AI Intelligence Index differ from the previous benchmark suite?
The new AI Intelligence Index, introduced by Artificial Analysis, replaces the old benchmark suite with ten real‑world evaluations that simulate everyday professional tasks. This shift aims to assess language models on concrete job‑related performance rather than abstract puzzles, providing a clearer picture of practical capabilities.
What percentage of well‑specified tasks did GPT‑5.2 beat or tie with top industry professionals in the original GDPval evaluation?
According to OpenAI, GPT‑5.2 beat or tied top industry professionals on 70.9 % of well‑specified tasks in the original GDPval evaluation. This result covers a range of knowledge‑work tasks across 44 occupations, indicating strong performance relative to human experts.
Which companies reported state‑of‑the‑art long‑horizon reasoning and tool‑calling performance from GPT‑5.2?
Companies such as Notion, Box, Shopify, Harvey, and Zoom observed GPT‑5.2 delivering state‑of‑the‑art long‑horizon reasoning and tool‑calling performance. Their feedback highlights the model’s ability to handle complex, multi‑step tasks in real‑world settings.
What concerns remain about the consistency of the new Intelligence Index compared to legacy benchmarks?
While the v4.0 framework of the Intelligence Index promises more relevant measurements, analysts question how its scores will align with legacy benchmarks across diverse models. The concern centers on whether the new real‑world tests can be directly compared to older evaluation metrics without losing continuity.