Gemini 3 Pro and GPT-5 stumble on graduate‑level physics benchmark
Why does a model that touts “pro” in its name matter to anyone doing real science? Because researchers have started testing large language models on tasks that go beyond textbook exercises—problems that have never been published and that would normally require a graduate student to design an independent experiment from scratch. Artificial Analysis built a physics suite that mirrors that level of novelty, then let the latest versions of Google’s Gemini 3 Pro Preview and OpenAI’s GPT‑5 take a crack at it.
The idea is simple: if a model can navigate truly open‑ended questions, it might serve as a research assistant rather than just a chat bot. What the numbers reveal, however, is sobering. In that independent evaluation, Google’s Gemini 3 Pro Preview managed only a single‑digit success rate, hitting 9.1 percent accuracy while consuming just 10 % of the compute budget allotted for the test.
The result raises questions about how close we really are to AI that can shoulder graduate‑level scientific work.
The benchmark asks models to solve original, unpublished research problems that resemble the work of a capable graduate student starting an independent project. In an independent evaluation by Artificial Analysis, Google's "Gemini 3 Pro Preview" reached just 9.1 percent accuracy while using 10 percent fewer tokens than OpenAI's "GPT-5.1 (high)," which placed second at 4.9 percent. Even at the top of the leaderboard, the systems miss the vast majority of tasks.
Doctoral-level reasoning remains a major hurdle CritPt includes 71 full research challenges from eleven physics fields, such as quantum physics, astrophysics, high-energy physics, and biophysics. To prevent guessing or retrieval, all problems are based on unpublished material. The team also broke each challenge into 190 smaller "checkpoints" to measure partial progress.
The findings offer a reality check: current large language models lack the rigor, creativity, and precision needed to solve open-ended physics problems on their own. Still, the models show measurable improvement on simpler, well-defined subtasks, which suggests that targeted support roles may be more realistic. The team also tested consistency using a stricter metric called the "consistently solved rate," which requires a model to give the correct answer four out of five times.
Under this requirement, performance collapses across the board, showing how fragile model reasoning remains even on tasks they sometimes solve. This lack of robustness creates a serious challenge for research workflows. The models often produce answers that look convincing but contain subtle errors that are difficult to catch, which can easily mislead researchers and require time-consuming expert review.
The researchers argue that, for the foreseeable future, the more realistic goal is not an "AI scientist" replacing human experts, but a "research assistant" automating specific workflow steps.
The CritPt benchmark, assembled by more than 50 physicists across 30 institutions, pushes AI beyond textbook recall toward the kind of open‑ended problems a graduate student would tackle. Models were asked to address original, unpublished research questions that mirror early‑stage PhD work. In an independent evaluation by Artificial Analysis, Google’s Gemini 3 Pro Preview achieved just 9.1 percent accuracy while using 10 perce—an outcome that falls far short of autonomous scientific performance.
GPT‑5, likewise, stumbled on the same tasks, reinforcing the gap between current capabilities and the demands of genuine research. The results suggest that, despite rapid advances, leading systems still cannot reliably generate or solve novel physics problems at the level required for independent investigation. Whether future iterations will bridge this divide remains uncertain; the benchmark highlights a concrete metric where progress is needed.
For now, the evidence points to a substantial shortfall in AI’s ability to act as an autonomous scientist in graduate‑level physics.
Further Reading
- Gemini 3 Tops New Physics Research Benchmark, Nearly Doubles Score Over GPT-5.1 - OfficeChai
- Gemini 3 vs GPT-5 vs Claude 4.5 vs Grok 4.1 - The Ultimate Reasoning Performance Battle - Vertu
- Google launches Gemini 3 with new coding app and record benchmark scores - TechCrunch
- Gemini 3 Pro — new GDM frontier model 6 - Smol AI News
- Gemini 3: Google's Most Powerful LLM - DataCamp
Common Questions Answered
What accuracy did Gemini 3 Pro Preview achieve on the CritPt benchmark, and how does that compare to GPT‑5.1 (high)?
In the independent evaluation by Artificial Analysis, Gemini 3 Pro Preview attained a 9.1 percent accuracy on the CritPt graduate‑level physics benchmark. This performance was more than double the 4.9 percent accuracy recorded by OpenAI’s GPT‑5.1 (high) on the same set of tasks.
How many tokens did Gemini 3 Pro use relative to GPT‑5 during the benchmark, and why is this notable?
Gemini 3 Pro Preview solved the benchmark using roughly 10 percent fewer tokens than GPT‑5.1 (high). The reduced token consumption is notable because it suggests a more efficient use of model capacity even though both models still missed the majority of problems.
Who created the CritPt benchmark and what is its purpose in evaluating AI models?
The CritPt benchmark was assembled by more than 50 physicists from 30 institutions to push AI beyond textbook recall toward open‑ended, graduate‑level research problems. Its purpose is to assess whether large language models can independently tackle original, unpublished physics questions that a PhD student might encounter.
What does the benchmark reveal about the current state of autonomous scientific performance in LLMs like Gemini 3 Pro and GPT‑5?
The benchmark shows that even the top‑performing models, Gemini 3 Pro Preview and GPT‑5.1, correctly solved fewer than ten percent of the tasks, indicating they fall far short of autonomous scientific performance. This highlights that current LLMs still struggle with the creativity and problem‑solving required for early‑stage PhD research.