Google's Veo-3 fakes surgical videos; 1.78 handling, 1.64 tissue, lowest logic
Google’s latest AI model, dubbed Veo‑3, can stitch together lifelike surgical footage that looks convincing at first glance. Researchers fed the system dozens of operating‑room clips, letting it learn the visual cadence of scalpel cuts, camera pans and the glow of cautery. The result is a video generator that can mimic the texture of organ surfaces and the motion of laparoscopic tools with a polish that would fool a casual viewer.
Yet the test wasn’t just about visual fidelity. When the evaluation shifted from “does it look right?” to “does it behave like a real procedure?”, the model’s scores fell sharply. In a benchmark focused on abdominal operations, the system was scored on three fronts: how well it handled instruments, how accurately it rendered tissue response, and whether the sequence followed logical surgical steps.
Those numbers tell a different story than the glossy frames it can produce.
But as soon as medical accuracy was required, its performance dropped. For abdominal procedures, instrument handling earned just 1.78 points, tissue response only 1.64, and surgical logic was lowest at 1.61. The AI could create convincing images, but it couldn't reproduce what actually happens in an operating room.
Brain surgery reveals even bigger gaps The challenge was even greater for brain surgery footage. From the first second, Veo-3 struggled with the fine precision required in neurosurgery. For brain operations, instrument handling dropped to 2.77 points (compared to 3.36 for abdominal) and surgical logic fell as low as 1.13 after eight seconds.
The team also broke down the types of errors. Over 93 percent were related to medical logic: the AI invented tools, imagined impossible tissue responses, or performed actions that made no clinical sense. Only a small fraction of errors (6.2 percent for abdominal and 2.8 percent for brain surgery) were tied to image quality.
Researchers tried giving Veo-3 more context, such as the type of surgery or the exact phase of the procedure. The results showed no meaningful or consistent improvement. According to the team, the real problem isn't the information provided, but the model's inability to process and understand it.
Visual medical understanding is still out of reach The SurgVeo study shows how far current video AI is from real medical understanding. While future systems could one day help train doctors, assist with surgical planning, or even guide procedures, today's models are nowhere near that level.
Can a model that dazzles visually ever replace a surgeon’s judgment? The study suggests not. While Veo‑3 generates seamless frames, its grasp of operative reality falls short, scoring barely above one on a five‑point scale for instrument handling (1.78), tissue response (1.64) and procedural logic (1.61).
Researchers built the SurgVeo benchmark from fifty authentic abdominal and brain procedures, then asked four seasoned surgeons to rate the AI’s predictions. Their verdict was consistent: the video looks plausible, but the underlying medical logic is missing. Consequently, the system’s utility in any clinical context remains doubtful.
Moreover, the assessment reveals a gap between visual fidelity and procedural understanding that the current model cannot bridge. Whether future iterations will close that gap is unclear; it's a sober reminder that visual realism does not equate to functional competence. Until the AI can reliably mirror the nuances of real surgery, its role will likely stay confined to demonstration rather than decision support.
Further Reading
- Google's Veo-3 can fake surgical videos but misses every hint of medical sense - The Decoder
- Veo 3 Prompt Writing Best Practices - Skywork AI Blog
- Veo - Google DeepMind - Google DeepMind
- Introducing Veo 3.1 and advanced capabilities in Flow - Google Keyword Blog
Common Questions Answered
What scores did Veo-3 receive for instrument handling, tissue response, and surgical logic on the SurgVeo benchmark?
Veo-3 scored 1.78 for instrument handling, 1.64 for tissue response, and 1.61 for surgical logic on a five‑point scale. These low numbers indicate the model’s poor performance in replicating true operative behavior despite its visual realism.
How was the Veo-3 model trained to generate surgical videos?
Researchers fed Veo-3 dozens of operating‑room clips, allowing it to learn the visual cadence of scalpel cuts, camera pans, and cautery glow. This training enabled the model to stitch together lifelike frames that can initially fool casual viewers.
Why did Veo-3 struggle more with brain surgery footage compared to abdominal procedures?
Brain surgery demands finer precision and more complex anatomical detail, which Veo-3 could not accurately reproduce from the first second of video. The model’s inability to capture these subtle movements highlighted larger gaps in its medical accuracy.
What is the SurgVeo benchmark and how was it used in evaluating Veo-3?
The SurgVeo benchmark consists of fifty authentic abdominal and brain procedures compiled for testing AI video generators. Four seasoned surgeons rated Veo-3’s output against this benchmark, consistently finding its operative realism lacking.