Google's Veo-3 fakes surgical videos; 1.78 handling, 1.64 tissue, lowest logic
When I first saw a clip generated by Google’s new AI model, Veo-3, it looked almost like real surgery. The team gave the system dozens of actual operating-room videos, so it could pick up the rhythm of scalpel cuts, the way cameras pan, and that faint cautery glow. After that training, Veo-3 can spin out footage that mimics the texture of organ surfaces and the jitter of laparoscopic tools well enough to trick someone just glancing at it.
But the test wasn’t only about how pretty the frames look. Once the reviewers asked, “does it behave like a real procedure?” the scores dropped fast. In a benchmark on abdominal surgeries, the model was judged on three things: handling of instruments, realism of tissue response, and whether the steps made logical sense.
The numbers suggest the polish you see on the screen doesn’t always line up with genuine surgical flow. It’s a reminder that visual fidelity and procedural accuracy are still separate challenges.
But as soon as medical accuracy was required, its performance dropped. For abdominal procedures, instrument handling earned just 1.78 points, tissue response only 1.64, and surgical logic was lowest at 1.61. The AI could create convincing images, but it couldn't reproduce what actually happens in an operating room.
Brain surgery reveals even bigger gaps The challenge was even greater for brain surgery footage. From the first second, Veo-3 struggled with the fine precision required in neurosurgery. For brain operations, instrument handling dropped to 2.77 points (compared to 3.36 for abdominal) and surgical logic fell as low as 1.13 after eight seconds.
The team also broke down the types of errors. Over 93 percent were related to medical logic: the AI invented tools, imagined impossible tissue responses, or performed actions that made no clinical sense. Only a small fraction of errors (6.2 percent for abdominal and 2.8 percent for brain surgery) were tied to image quality.
Researchers tried giving Veo-3 more context, such as the type of surgery or the exact phase of the procedure. The results showed no meaningful or consistent improvement. According to the team, the real problem isn't the information provided, but the model's inability to process and understand it.
Visual medical understanding is still out of reach The SurgVeo study shows how far current video AI is from real medical understanding. While future systems could one day help train doctors, assist with surgical planning, or even guide procedures, today's models are nowhere near that level.
Can a model that dazzles visually ever replace a surgeon’s judgment? The data says probably not. Veo-3 can stitch together smooth video, but when you look at the numbers it barely clears the bottom rung - instrument handling 1.78, tissue response 1.64, procedural logic 1.61 on a five-point scale.
The team behind the study assembled the SurgVeo benchmark from fifty real abdominal and brain operations, then handed the clips to four veteran surgeons for scoring. Their reaction was unanimous: the footage looks convincing, yet the medical reasoning is absent. That makes it hard to see any real clinical use right now.
The gap between how pretty the video looks and how well the model understands the procedure seems too wide for the current version. Whether the next generation will narrow that gap is still up in the air; it reminds us that visual polish isn’t the same as functional skill. Until the AI can capture the subtle decisions a surgeon makes, it will probably stay a demo tool rather than a decision-aid.
Common Questions Answered
What scores did Veo-3 receive for instrument handling, tissue response, and surgical logic on the SurgVeo benchmark?
Veo-3 scored 1.78 for instrument handling, 1.64 for tissue response, and 1.61 for surgical logic on a five‑point scale. These low numbers indicate the model’s poor performance in replicating true operative behavior despite its visual realism.
How was the Veo-3 model trained to generate surgical videos?
Researchers fed Veo-3 dozens of operating‑room clips, allowing it to learn the visual cadence of scalpel cuts, camera pans, and cautery glow. This training enabled the model to stitch together lifelike frames that can initially fool casual viewers.
Why did Veo-3 struggle more with brain surgery footage compared to abdominal procedures?
Brain surgery demands finer precision and more complex anatomical detail, which Veo-3 could not accurately reproduce from the first second of video. The model’s inability to capture these subtle movements highlighted larger gaps in its medical accuracy.
What is the SurgVeo benchmark and how was it used in evaluating Veo-3?
The SurgVeo benchmark consists of fifty authentic abdominal and brain procedures compiled for testing AI video generators. Four seasoned surgeons rated Veo-3’s output against this benchmark, consistently finding its operative realism lacking.