Illustration for: GPT-5.2 leads FrontierScience test, but falters on real research tasks
Research & Benchmarks

GPT-5.2 leads FrontierScience test, but falters on real research tasks

3 min read

OpenAI’s latest benchmark, the FrontierScience test, pits its newest model, GPT‑5.2, against a suite of research‑oriented problems that demand more than quick fact‑checking. Unlike typical language‑model evaluations that reward a single correct response, this assessment asks systems to navigate multi‑step investigations, mirroring the kind of work human scholars spend hours on. The test’s design forces models to produce nuanced arguments that are then graded on a ten‑point scale, with the earlier GPT‑5 version already handling the automated scoring under high reasoning intensity.

While the headline‑grabbing result shows GPT‑5.2 edging out its predecessor, the broader picture remains mixed: the tasks still prove challenging for even the most advanced iteration. The implications are clear—raw language ability alone isn’t enough when the bar is set at sustained, deep‑thinking research performance.

OpenAI says each task should take at least three to five hours to solve. Instead of a single correct answer, research tasks are scored on a ten-point rubric, with GPT-5 handling the automated grading at high reasoning intensity. GPT-5.2 leads, but research tasks remain tough All reasoning models were tested at "high" reasoning intensity, with GPT-5.2 also tested at "xhigh" - and without browsing enabled.

GPT-5.2 scores 77 percent on the Olympiad set and 25 percent on Research. Gemini 3 Pro trails close behind on Olympiad at 76 percent. On Research, GPT-5.2 and GPT-5 tie for first place, and in a "surprising" (OpenAI) twist, GPT-5 significantly outperforms the newer GPT-5.1, which manages just about 19 percent.

Claude Opus 4.5 hits 71 percent on Olympiad and 18 percent on Research. Grok 4 scores 66.2 percent and 16 percent respectively. The older GPT-4o lags far behind at 12 percent on Olympiad and under one percent on Research.

OpenAI's first reasoning model o1, released last September, marked a major leap forward. More compute means better results Performance scales with compute time. GPT-5.2 jumps from 67.5 percent at low reasoning intensity to 77 percent at the highest setting on the Olympiad set.

On Research, scores climb from 18 to 25 percent. OpenAI's o3 model bucks the trend on Research: it actually does slightly worse at high reasoning intensity than at medium. The company calls this "surprising" but doesn't explain why.

OpenAI says the results show real progress on expert-level questions but leave plenty of room for improvement, especially on open research tasks.

Related Topics: #OpenAI #GPT-5.2 #FrontierScience test #research tasks #high reasoning intensity #Olympiad #Gemini 3 Pro #Claude Opus 4.5 #Grok 4 #GPT-4o

GPT‑5.2’s lead on the FrontierScience benchmark is undeniable, yet the picture is far from complete. A 92 percent score on the GPQA suite marks a striking jump from GPT‑4’s 39 percent two years earlier, and OpenAI argues that such progress forces a tightening of evaluation standards. However, the benchmark’s design—tasks that should take three to five hours and are judged on a ten‑point rubric—still abstracts away the messier realities of day‑to‑day research.

Can these headline numbers survive the scrutiny of actual scientific workflows? The article notes that research‑level tasks “remain tough,” and that GPT‑5.2, while handling automated grading at high reasoning intensity, still falters on genuine research problems.

Unclear whether the model’s reasoning depth will bridge that gap, or whether the current scoring framework captures the nuances of hypothesis generation, experimental design, and iterative analysis. The results highlight both rapid improvement and persistent limitations, reminding readers that benchmark success does not automatically translate into reliable research assistance.

Further Reading

Common Questions Answered

How did GPT‑5.2 perform on the Olympiad and Research subsets of the FrontierScience test?

GPT‑5.2 achieved a 77 percent score on the Olympiad set and only 25 percent on the Research subset. These results show strong performance on structured problems but difficulty with longer, multi‑step investigations.

What scoring system does the FrontierScience benchmark use for research‑oriented tasks?

The benchmark grades each research task on a ten‑point rubric rather than a single correct answer. This approach requires models to generate nuanced arguments that are evaluated for depth and accuracy.

Did GPT‑5.2 have browsing enabled during the FrontierScience evaluation, and why is that relevant?

No, GPT‑5.2 was tested without browsing enabled, meaning it could not retrieve up‑to‑date external information. This restriction emphasizes the model’s internal reasoning abilities rather than reliance on live web searches.

How does GPT‑5.2’s 92 percent score on the GPQA suite compare to GPT‑4’s performance two years earlier?

GPT‑5.2’s 92 percent result represents a dramatic improvement over GPT‑4’s 39 percent score from two years prior. OpenAI cites this jump as evidence that evaluation standards need to become stricter.