Skip to main content
Journalist points to a monitor showing Gemini 3 Pro and GPT‑5 logos next to a graduate‑level physics equation.

Gemini 3 Pro and GPT-5 stumble on graduate‑level physics benchmark

3 min read

When a model brands itself with “Pro” you might wonder if that matters for real science. I’ve seen researchers lately push large language models past textbook drills, tossing them problems that have never been published, stuff a grad student would normally have to design from scratch. To test that, Artificial Analysis put together a physics suite that tries to capture that level of novelty, then handed the latest Google Gemini 3 Pro Preview and OpenAI’s GPT-5 a go at it.

The premise is straightforward: if a model can wrestle with truly open-ended questions, maybe it could act as a research assistant instead of just a chatbot. The numbers, though, are a bit sobering. In the independent run, Gemini 3 Pro Preview cracked only a single-digit success rate, about 9.1 percent accuracy, while using roughly 10 % of the compute budget we set aside.

It seems we’re still a ways off from handing a thesis to a bot, and the result leaves me wondering how close we really are to AI that can shoulder graduate-level scientific work.

The benchmark asks models to solve original, unpublished research problems that resemble the work of a capable graduate student starting an independent project. In an independent evaluation by Artificial Analysis, Google's "Gemini 3 Pro Preview" reached just 9.1 percent accuracy while using 10 percent fewer tokens than OpenAI's "GPT-5.1 (high)," which placed second at 4.9 percent. Even at the top of the leaderboard, the systems miss the vast majority of tasks.

Doctoral-level reasoning remains a major hurdle CritPt includes 71 full research challenges from eleven physics fields, such as quantum physics, astrophysics, high-energy physics, and biophysics. To prevent guessing or retrieval, all problems are based on unpublished material. The team also broke each challenge into 190 smaller "checkpoints" to measure partial progress.

The findings offer a reality check: current large language models lack the rigor, creativity, and precision needed to solve open-ended physics problems on their own. Still, the models show measurable improvement on simpler, well-defined subtasks, which suggests that targeted support roles may be more realistic. The team also tested consistency using a stricter metric called the "consistently solved rate," which requires a model to give the correct answer four out of five times.

Under this requirement, performance collapses across the board, showing how fragile model reasoning remains even on tasks they sometimes solve. This lack of robustness creates a serious challenge for research workflows. The models often produce answers that look convincing but contain subtle errors that are difficult to catch, which can easily mislead researchers and require time-consuming expert review.

The researchers argue that, for the foreseeable future, the more realistic goal is not an "AI scientist" replacing human experts, but a "research assistant" automating specific workflow steps.

Related Topics: #Gemini 3 Pro #GPT-5 #Artificial Analysis #graduate-level physics #large language models #quantum physics #astrophysics #high-energy physics #biophysics

The CritPt benchmark was put together by over 50 physicists at 30 different institutions, and it tries to push AI past simple textbook recall into the sort of open-ended problems a graduate student might face. The test gave models brand-new, unpublished research questions that look a lot like early-stage PhD work. In an independent check by Artificial Analysis, Google’s Gemini 3 Pro Preview only managed about 9.1 percent accuracy while using roughly 10 percent of the compute budget - a result that seems far from what we’d call autonomous scientific performance.

GPT-5 fared similarly, tripping over the same tasks and underscoring the gap between today’s capabilities and real research demands. It appears that, even with rapid progress, the leading systems still struggle to reliably generate or solve novel physics problems at a level that would support independent investigation. Whether future versions will close that gap is still unclear; the benchmark gives us a concrete yardstick where improvement is needed.

For now, the data point to a sizable shortfall in AI’s ability to act as an autonomous scientist in graduate-level physics.

Common Questions Answered

What accuracy did Gemini 3 Pro Preview achieve on the CritPt benchmark, and how does that compare to GPT‑5.1 (high)?

In the independent evaluation by Artificial Analysis, Gemini 3 Pro Preview attained a 9.1 percent accuracy on the CritPt graduate‑level physics benchmark. This performance was more than double the 4.9 percent accuracy recorded by OpenAI’s GPT‑5.1 (high) on the same set of tasks.

How many tokens did Gemini 3 Pro use relative to GPT‑5 during the benchmark, and why is this notable?

Gemini 3 Pro Preview solved the benchmark using roughly 10 percent fewer tokens than GPT‑5.1 (high). The reduced token consumption is notable because it suggests a more efficient use of model capacity even though both models still missed the majority of problems.

Who created the CritPt benchmark and what is its purpose in evaluating AI models?

The CritPt benchmark was assembled by more than 50 physicists from 30 institutions to push AI beyond textbook recall toward open‑ended, graduate‑level research problems. Its purpose is to assess whether large language models can independently tackle original, unpublished physics questions that a PhD student might encounter.

What does the benchmark reveal about the current state of autonomous scientific performance in LLMs like Gemini 3 Pro and GPT‑5?

The benchmark shows that even the top‑performing models, Gemini 3 Pro Preview and GPT‑5.1, correctly solved fewer than ten percent of the tasks, indicating they fall far short of autonomous scientific performance. This highlights that current LLMs still struggle with the creativity and problem‑solving required for early‑stage PhD research.