Skip to main content
OpenAI unveils GPT-5.5 breakthrough, achieving 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval, showcasing advanced AI perfor

Editorial illustration for OpenAI launches GPT-5.5, hits 82.7% on Terminal-Bench 2.0, 84.9% on GDPval

GPT-5.5 Crushes Benchmarks with 82.7% Agentic AI Score

OpenAI launches GPT-5.5, hits 82.7% on Terminal-Bench 2.0, 84.9% on GDPval

2 min read

OpenAI just rolled out GPT‑5.5, a fully retrained agentic model that clocks 82.7 % on Terminal‑Bench 2.0 and 84.9 % on GDPval. Those numbers look tidy on paper, but they tell only part of the story. The real test for any large‑language model isn’t a single‑prompt quiz; it’s whether the system can stay useful across the kind of prolonged, iterative work developers actually do.

Think of a codebase that’s been around for years, a set of interdependent modules, or a refactor that stretches over several days. That’s the terrain where productivity gains become measurable. OpenAI’s latest release includes data on an internal benchmark called Expert‑SWE, which focuses on exactly that sort of extended engineering effort.

The benchmark measures tasks with a median estimated human completion time of 20 hours, offering a glimpse into how the model handles the grind of multi‑session development work.

For long‑horizon coding specifically, OpenAI also reports results on Expert‑SWE, an internal benchmark measuring tasks with a median estimated human completion time of 20 hours. This benchmark is significant because it reflects the kind of extended, multi‑session engineering work — large refactors,

For long-horizon coding specifically, OpenAI also reports results on Expert-SWE, an internal benchmark measuring tasks with a median estimated human completion time of 20 hours. This benchmark is significant because it reflects the kind of extended, multi-session engineering work -- large refactors, feature builds, debugging deep in a codebase -- that agentic tools are increasingly being asked to handle autonomously. Developers who tested the system early said GPT-5.5 has a better understanding of the "shape" of a software system, and can better understand why something is failing, where the fix is needed, and what else in the codebase would be affected.

Will GPT‑5.5 live up to its promise? OpenAI says the new model, the first fully retrained base since GPT‑4.5, hits 82.7 % on Terminal‑Bench 2.0 and 84.9 % on GDPval. Those numbers suggest a measurable jump over previous releases, yet the benchmarks are internal and limited to specific task families.

Designed to tackle complex, multi‑step computer tasks with minimal human direction, the model is framed as an assistant that grasps the underlying goal rather than following a checklist. It rolls out today to Plus, Pro, Business and Enterprise users via ChatGPT and Codex. For long‑horizon coding, OpenAI cites results on Expert‑SWE, an internal benchmark where the median human effort is estimated at twenty hours, reflecting large refactors and multi‑session work.

However, the article does not disclose GPT‑5.5’s score on that benchmark, leaving it unclear how the model performs on extended engineering projects. The rollout is limited to subscribers, so broader accessibility remains uncertain. Ultimately, the data presented is promising, but real‑world impact has yet to be demonstrated.

Further Reading

Common Questions Answered

How does GPT-5.5 perform on the Expert-SWE benchmark for long-horizon coding tasks?

GPT-5.5 is evaluated on Expert-SWE, an internal benchmark that measures tasks with a median estimated human completion time of 20 hours. This benchmark is crucial as it assesses the model's ability to handle complex, extended engineering work like large refactors, feature builds, and deep codebase debugging.

What makes GPT-5.5 different from previous OpenAI language models?

GPT-5.5 is the first fully retrained base model since GPT-4.5, achieving impressive scores of 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval. The model is designed to be an agentic assistant that understands underlying goals, rather than simply following a rigid checklist of instructions.

What are the key challenges in evaluating GPT-5.5's capabilities?

While GPT-5.5 shows promising benchmark results, the evaluation is limited by internal benchmarks and specific task families. The true test of the model lies in its ability to perform prolonged, iterative work across complex coding environments and interdependent modules.