Skip to main content
AI-powered Vision Banana model outperforming SAM 3 and Depth Anything V3 in computer vision benchmark tests, showcasing advan

Editorial illustration for Google DeepMind's Vision Banana Outperforms SAM 3 and Depth Anything V3

DeepMind Vision Banana Beats SAM 3 in AI Perception

Google DeepMind's Vision Banana Outperforms SAM 3 and Depth Anything V3

2 min read

Google DeepMind’s latest model, dubbed Vision Banana, has just topped two well‑known benchmarks: it outperformed Meta’s SAM 3 on segmentation and eclipsed Depth Anything V3 on metric depth estimation. The result is striking because both reference systems have been the go‑to baselines for visual perception tasks in recent research. Yet Vision Banana arrived not as a purpose‑built segmenter or depth estimator, but as an instruction‑tuned image generator.

While the model’s primary training objective was to produce pictures from text prompts, the evaluation shows it can repurpose that knowledge for tasks it never explicitly learned. Here’s the thing: if a generator can double‑duty as a competent recognizer, the line between “generative” and “perceptual” AI may be thinner than most assume. That raises questions about how much of a model’s internal representation is shaped by the data versus the task label.

The upcoming key takeaways spell out why this matters for the broader vision community.

Key Takeaways - Image generation pretraining is a generalist vision learner: Just as LLM pretraining unlocks emergent language understanding, Google's research shows that training on image generation naturally develops powerful internal visual representations that transfer to perception tasks like segmentation, depth estimation, and surface normal estimation. - Vision Banana beats specialist models without specialist architecture: Built by lightweight instruction-tuning of Nano Banana Pro, Vision Banana surpasses SAM 3 on three segmentation benchmarks, Depth Anything V3 on metric depth estimation (δ1: 0.929 vs 0.918), and Lotus-2 on surface normal estimation (mean angle error: 18.928° vs 19.642°) -- all in zero-shot transfer settings. - All vision tasks are reframed as image generation: By parameterizing vision task outputs as RGB images with decodable color schemes, Vision Banana uses a single set of weights and prompt-only switching across semantic segmentation, instance segmentation, depth estimation, and surface normal estimation -- no task-specific modules required.

Vision Banana marks a notable shift in how image‑generation pretraining is viewed. By topping SAM 3 on segmentation and surpassing Depth Anything V3 on metric depth estimation, the model shows that generative training can yield representations useful for classic perception tasks. The authors liken this to the way large‑language‑model pretraining unlocked emergent linguistic abilities, suggesting a parallel in the visual domain.

Yet the paper stops short of claiming universal applicability; it remains unclear whether the same transfer will hold across more diverse datasets or real‑world deployments. The results challenge the long‑standing split between generative and discriminative research tracks, prompting a re‑examination of entrenched assumptions. Critics may point out that the benchmarks used represent a narrow slice of vision problems, and further work will be needed to assess robustness under varied conditions.

Nonetheless, the evidence presented provides a concrete example that image‑generation objectives can cultivate internal visual knowledge that extends beyond pure synthesis, opening a modest but tangible avenue for future exploration.

Further Reading

Common Questions Answered

How did Vision Banana outperform specialized models like SAM 3 and Depth Anything V3?

Vision Banana achieved superior performance by leveraging image generation pretraining, which naturally develops powerful internal visual representations. Unlike purpose-built segmentation or depth estimation models, this instruction-tuned image generator demonstrated remarkable transfer learning capabilities across different visual perception tasks.

What does Vision Banana reveal about image generation pretraining?

The model shows that image generation pretraining can function as a generalist vision learner, similar to how large language models develop emergent linguistic abilities. By training on image generation, the model inherently develops sophisticated visual representations that can transfer effectively to tasks like segmentation, depth estimation, and surface normal estimation.

What makes Vision Banana's approach different from traditional specialized vision models?

Vision Banana was developed through lightweight instruction-tuning of a generative model, rather than being architecturally designed for specific perception tasks. This approach challenges traditional model development by demonstrating that generative pretraining can yield representations powerful enough to outperform specialist models without requiring task-specific architectural modifications.