OpenAI unveils GPT-5.5 breakthrough, achieving 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval, showcasing advanced AI perfor

Editorial illustration for OpenAI launches GPT-5.5, hits 82.7% on Terminal-Bench 2.0, 84.9% on GDPval

GPT-5.5 Crushes Benchmarks with 82.7% Agentic AI Score

OpenAI launches GPT-5.5, hits 82.7% on Terminal-Bench 2.0, 84.9% on GDPval

April 24, 2026 • 2 min read

OpenAI just rolled out GPT‑5.5, a fully retrained agentic model that clocks 82.7 % on Terminal‑Bench 2.0 and 84.9 % on GDPval. Those numbers look tidy on paper, but they tell only part of the story. The real test for any large‑language model isn’t a single‑prompt quiz; it’s whether the system can stay useful across the kind of prolonged, iterative work developers actually do.

Think of a codebase that’s been around for years, a set of interdependent modules, or a refactor that stretches over several days. That’s the terrain where productivity gains become measurable. OpenAI’s latest release includes data on an internal benchmark called Expert‑SWE, which focuses on exactly that sort of extended engineering effort.

The benchmark measures tasks with a median estimated human completion time of 20 hours, offering a glimpse into how the model handles the grind of multi‑session development work.

For long‑horizon coding specifically, OpenAI also reports results on Expert‑SWE, an internal benchmark measuring tasks with a median estimated human completion time of 20 hours. This benchmark is significant because it reflects the kind of extended, multi‑session engineering work — large refactors,

For long-horizon coding specifically, OpenAI also reports results on Expert-SWE, an internal benchmark measuring tasks with a median estimated human completion time of 20 hours. This benchmark is significant because it reflects the kind of extended, multi-session engineering work -- large refactors, feature builds, debugging deep in a codebase -- that agentic tools are increasingly being asked to handle autonomously. Developers who tested the system early said GPT-5.5 has a better understanding of the "shape" of a software system, and can better understand why something is failing, where the fix is needed, and what else in the codebase would be affected.

OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval - MarkTechPost

Will GPT‑5.5 live up to its promise? OpenAI says the new model, the first fully retrained base since GPT‑4.5, hits 82.7 % on Terminal‑Bench 2.0 and 84.9 % on GDPval. Those numbers suggest a measurable jump over previous releases, yet the benchmarks are internal and limited to specific task families.

Designed to tackle complex, multi‑step computer tasks with minimal human direction, the model is framed as an assistant that grasps the underlying goal rather than following a checklist. It rolls out today to Plus, Pro, Business and Enterprise users via ChatGPT and Codex. For long‑horizon coding, OpenAI cites results on Expert‑SWE, an internal benchmark where the median human effort is estimated at twenty hours, reflecting large refactors and multi‑session work.

However, the article does not disclose GPT‑5.5’s score on that benchmark, leaving it unclear how the model performs on extended engineering projects. The rollout is limited to subscribers, so broader accessibility remains uncertain. Ultimately, the data presented is promising, but real‑world impact has yet to be demonstrated.

Common Questions Answered

How does GPT-5.5 perform on the Expert-SWE benchmark for long-horizon coding tasks?

GPT-5.5 is evaluated on Expert-SWE, an internal benchmark that measures tasks with a median estimated human completion time of 20 hours. This benchmark is crucial as it assesses the model's ability to handle complex, extended engineering work like large refactors, feature builds, and deep codebase debugging.

What makes GPT-5.5 different from previous OpenAI language models?

GPT-5.5 is the first fully retrained base model since GPT-4.5, achieving impressive scores of 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval. The model is designed to be an agentic assistant that understands underlying goals, rather than simply following a rigid checklist of instructions.

What are the key challenges in evaluating GPT-5.5's capabilities?

While GPT-5.5 shows promising benchmark results, the evaluation is limited by internal benchmarks and specific task families. The true test of the model lies in its ability to perform prolonged, iterative work across complex coding environments and interdependent modules.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

Further Reading

Related Reading

Ant Group unveils Ring-1T, first open-source trillion-parameter reasoning model

OpenAI's Codex powers Lovable AI, letting millions create apps from text

Gen AI app sessions up fivefold, downloads jump 778% as ChatGPT leads traffic

OpenAI, a Series F San Francisco startup founded in 2015 by eight pioneers

GPT-5 helps mathematicians offload tedious tasks, says Timothy Gowers

Google launches TPU 8t for high‑throughput training, TPU 8i for memory bandwidth

Google unveils dual high‑powered TPUs, sidestepping Nvidia tax for enterprises

OpenAI releases open‑source, on‑device Privacy Filter to scrub enterprise data

OpenAI introduces workspace agents that autonomously report product feedback

Common Questions Answered

How does GPT-5.5 perform on the Expert-SWE benchmark for long-horizon coding tasks?

What makes GPT-5.5 different from previous OpenAI language models?

What are the key challenges in evaluating GPT-5.5's capabilities?