Editorial illustration for Google DeepMind AI co‑clinician beats GPT‑5.4 in blind tests, lags docs
Google DeepMind AI co‑clinician beats GPT‑5.4 in blind...
Google DeepMind AI co‑clinician beats GPT‑5.4 in blind tests, lags docs
DeepMind’s latest “AI co‑clinician” has just outscored GPT‑5.4 in a series of blind assessments, yet it still falls short of seasoned doctors. Why does that gap matter? While the headline touts a win over a well‑known language model, the real test was whether the system could function as a genuine team member in a hospital ward.
To find out, researchers paired the AI with academic physicians and ran it through the NOHARM framework, a checklist that flags two distinct categories of mistakes. The setup mimics a clinician’s daily workflow, forcing the algorithm to juggle diagnosis, treatment suggestions, and safety checks under human supervision. Results show the AI can beat a generic chatbot, but the error profile remains a concern.
The next step, then, is to see how the system performs when its role is defined not just as a tool, but as a collaborative partner in patient care.
While it's early days, the promise is clear," says Deepmind researcher Alan Karthikesalingam.
The trial offers a measured glimpse of what an AI co‑clinician might achieve today. Physicians gave the DeepMind system higher marks than GPT‑5.4 in blind scenarios that mimicked routine general‑practice work, suggesting the model can produce clinically relevant answers under realistic constraints. Yet the same evaluations showed seasoned doctors still outperformed the AI, leaving a gap that technology has yet to close.
How the system will behave when integrated into real‑world workflows remains uncertain, especially since the study focused on response quality rather than outcomes such as patient safety or diagnostic speed. The developers framed the tool as a team member, operating under a clinician’s supervision and evaluated with an adapted NOHARM framework that flags two categories of mistakes. Without data on how often those errors occur in practice, it's unclear whether the current performance level justifies broader adoption.
In short, the AI co‑clinician shows promise relative to other models, but its superiority over human expertise is not established, and further testing will be needed to determine its practical role.
Further Reading
- Enabling a new model for healthcare with AI co-clinician - Google DeepMind
- Papers with Code Benchmarks - Papers with Code
- Chatbot Arena Leaderboard - LMSYS