Editorial illustration for GPT-5.5 scores 71.4% on expert cybersecurity tasks, edging Mythos Preview's 68.6%
GPT-5.5 scores 71.4% on expert cybersecurity tasks,...
GPT-5.5 scores 71.4% on expert cybersecurity tasks, edging Mythos Preview's 68.6%
Why does this matter? Because the latest round of AI‑driven security tests pits OpenAI’s GPT‑5.5 against the much‑talked‑about Mythos Preview in a head‑to‑head evaluation of “Expert”‑level tasks. While both models are being measured on the same benchmark, the numbers reveal a narrow gap that could influence how enterprises choose automated defenses.
While the test suite includes everything from threat‑intel summarisation to code‑level reverse engineering, the most demanding challenges require the model to generate functional tools—like a disassembler capable of parsing a Rust binary. The assessment, conducted by the AI Security Institute (AISI), reports average pass rates that sit just above the statistical noise. But here’s the thing: even a few percentage points can shift confidence levels among security teams weighing AI assistance against traditional methods.
The following excerpt captures the exact figures and a concrete example that illustrates where GPT‑5.5 nudges ahead, albeit within the margin of error.
In one particularly difficult task that involved building a disassembler to decode a Rust binary, AISI notes that "GPT-5.5 solved the challenge in 10 minutes and 22 seconds with no human assistance at a cost of $1.73" in API calls.
What does the data actually tell us? AISI’s evaluation shows GPT‑5.5 achieving 71.4 % on the highest‑level “Expert” cybersecurity tasks, a shade above Mythos Preview’s 68.6 %. The margin of error, however, overlaps, leaving it unclear whether the gap reflects a genuine advantage or statistical noise.
In the most demanding scenario—a task that required building a disassembler to decode a Rust binary—AISI notes that “GPT …” performed sufficiently to complete the assignment, yet the report stops short of declaring mastery. Anthropic’s decision to limit Mythos Preview to “critical industry partners” underscores the perceived risk, but the new figures suggest OpenAI’s model can hold its own in comparable evaluations. Whether this parity will translate into practical security tools remains uncertain; the tests are controlled, and real‑world deployment brings variables the study does not capture.
For now, the numbers point to a modest edge for GPT‑5.5, tempered by the inherent uncertainty of early‑stage benchmarking.
Further Reading
- Unpacking the GPT-5.5 System Card - Ken Huang Substack
- GPT-5.5 System Card - OpenAI Deployment Safety Hub
- OpenAI announces GPT-5.5 - Daily.dev