Editorial illustration for Google DeepMind AI co‑clinician beats GPT‑5.4 in blind tests, lags docs
Google DeepMind AI co‑clinician beats GPT‑5.4 in blind...
Google DeepMind AI co‑clinician beats GPT‑5.4 in blind tests, lags docs
DeepMind’s latest “AI co‑clinician” has just outscored GPT‑5.4 in a series of blind assessments, yet it still falls short of seasoned doctors. Why does that gap matter? While the headline touts a win over a well‑known language model, the real test was whether the system could function as a genuine team member in a hospital ward.
To find out, researchers paired the AI with academic physicians and ran it through the NOHARM framework, a checklist that flags two distinct categories of mistakes. The setup mimics a clinician’s daily workflow, forcing the algorithm to juggle diagnosis, treatment suggestions, and safety checks under human supervision. Results show the AI can beat a generic chatbot, but the error profile remains a concern.
The next step, then, is to see how the system performs when its role is defined not just as a tool, but as a collaborative partner in patient care.
In a blind comparison using 98 realistic primary care queries, doctors consistently picked the AI co-clinician's answers over leading evidence synthesis tools. It won 67 to 26 against an existing clinical AI system and 63 to 30 against GPT-5.4-thinking-with-search. In the objective analysis, the system logged a critical error in one of the 98 cases.
The lead was even bigger on medication questions. The RxQA benchmark covers 600 questions on active ingredients, interactions, and dosages, drawn from national drug directories in two countries and vetted by licensed pharmacists. These questions are tough for primary care doctors: with reference books, they got 61.3 percent right, and just 48.3 percent without.
The AI co-clinician scored 73.3 percent, just ahead of GPT-5.4-thinking-with-search at 72.7 percent. The gap widened when questions were asked open-ended rather than as multiple choice, the way doctors actually look things up on the job. Here the AI co-clinician hit a quality score of 95.0 percent, compared to 90.9 percent for OpenAI's model.
The trial offers a measured glimpse of what an AI co‑clinician might achieve today. Physicians gave the DeepMind system higher marks than GPT‑5.4 in blind scenarios that mimicked routine general‑practice work, suggesting the model can produce clinically relevant answers under realistic constraints. Yet the same evaluations showed seasoned doctors still outperformed the AI, leaving a gap that technology has yet to close.
How the system will behave when integrated into real‑world workflows remains uncertain, especially since the study focused on response quality rather than outcomes such as patient safety or diagnostic speed. The developers framed the tool as a team member, operating under a clinician’s supervision and evaluated with an adapted NOHARM framework that flags two categories of mistakes. Without data on how often those errors occur in practice, it's unclear whether the current performance level justifies broader adoption.
In short, the AI co‑clinician shows promise relative to other models, but its superiority over human expertise is not established, and further testing will be needed to determine its practical role.
Further Reading
- Enabling a new model for healthcare with AI co-clinician - Google DeepMind
- Papers with Code Benchmarks - Papers with Code
- Chatbot Arena Leaderboard - LMSYS