AI-powered Vision Banana model outperforming SAM 3 and Depth Anything V3 in computer vision benchmark tests, showcasing advan

Editorial illustration for Google DeepMind's Vision Banana Outperforms SAM 3 and Depth Anything V3

DeepMind Vision Banana Beats SAM 3 in AI Perception

Google DeepMind's Vision Banana Outperforms SAM 3 and Depth Anything V3

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

April 25, 2026 • Updated: April 28, 2026 • 2 min read

Google DeepMind’s latest model, dubbed Vision Banana, has just topped two well‑known benchmarks: it outperformed Meta’s SAM 3 on segmentation and eclipsed Depth Anything V3 on metric depth estimation. The result is striking because both reference systems have been the go‑to baselines for visual perception tasks in recent research. Yet Vision Banana arrived not as a purpose‑built segmenter or depth estimator, but as an instruction‑tuned image generator.

While the model’s primary training objective was to produce pictures from text prompts, the evaluation shows it can repurpose that knowledge for tasks it never explicitly learned. Here’s the thing: if a generator can double‑duty as a competent recognizer, the line between “generative” and “perceptual” AI may be thinner than most assume. That raises questions about how much of a model’s internal representation is shaped by the data versus the task label.

The upcoming key takeaways spell out why this matters for the broader vision community.

Key Takeaways - Image generation pretraining is a generalist vision learner: Just as LLM pretraining unlocks emergent language understanding, Google's research shows that training on image generation naturally develops powerful internal visual representations that transfer to perception tasks like segmentation, depth estimation, and surface normal estimation. - Vision Banana beats specialist models without specialist architecture: Built by lightweight instruction-tuning of Nano Banana Pro, Vision Banana surpasses SAM 3 on three segmentation benchmarks, Depth Anything V3 on metric depth estimation (δ1: 0.929 vs 0.918), and Lotus-2 on surface normal estimation (mean angle error: 18.928° vs 19.642°) -- all in zero-shot transfer settings. - All vision tasks are reframed as image generation: By parameterizing vision task outputs as RGB images with decodable color schemes, Vision Banana uses a single set of weights and prompt-only switching across semantic segmentation, instance segmentation, depth estimation, and surface normal estimation -- no task-specific modules required.

Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation - MarkTechPost

Vision Banana marks a notable shift in how image‑generation pretraining is viewed. By topping SAM 3 on segmentation and surpassing Depth Anything V3 on metric depth estimation, the model shows that generative training can yield representations useful for classic perception tasks. The authors liken this to the way large‑language‑model pretraining unlocked emergent linguistic abilities, suggesting a parallel in the visual domain.

Yet the paper stops short of claiming universal applicability; it remains unclear whether the same transfer will hold across more diverse datasets or real‑world deployments. The results challenge the long‑standing split between generative and discriminative research tracks, prompting a re‑examination of entrenched assumptions. Critics may point out that the benchmarks used represent a narrow slice of vision problems, and further work will be needed to assess robustness under varied conditions.

Nonetheless, the evidence presented provides a concrete example that image‑generation objectives can cultivate internal visual knowledge that extends beyond pure synthesis, opening a modest but tangible avenue for future exploration.

Common Questions Answered

How did Vision Banana outperform specialized models like SAM 3 and Depth Anything V3?

Vision Banana achieved superior performance by leveraging image generation pretraining, which naturally develops powerful internal visual representations. Unlike purpose-built segmentation or depth estimation models, this instruction-tuned image generator demonstrated remarkable transfer learning capabilities across different visual perception tasks.

What does Vision Banana reveal about image generation pretraining?

The model shows that image generation pretraining can function as a generalist vision learner, similar to how large language models develop emergent linguistic abilities. By training on image generation, the model inherently develops sophisticated visual representations that can transfer effectively to tasks like segmentation, depth estimation, and surface normal estimation.

What makes Vision Banana's approach different from traditional specialized vision models?

Vision Banana was developed through lightweight instruction-tuning of a generative model, rather than being architecturally designed for specific perception tasks. This approach challenges traditional model development by demonstrating that generative pretraining can yield representations powerful enough to outperform specialist models without requiring task-specific architectural modifications.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

DeepMind Vision Banana Beats SAM 3 in AI Perception

Further Reading

Common Questions Answered

How did Vision Banana outperform specialized models like SAM 3 and Depth Anything V3?

What does Vision Banana reveal about image generation pretraining?

What makes Vision Banana's approach different from traditional specialized vision models?

Latest News

AI‑enhanced lessons in Sierra Leone: teachers lead impact study

CoCoNuT paradigm expands residual stream for latent‑space, multi‑path reasoning

OmniMem adds modality-aware memory allocation for audio‑visual LLMs

AI agents solve neuroscience pipeline tasks on datasets larger than benchmarks

ML models predict World Cup outcomes, but miss draws, capture team strength

MedicalRec releases MedicalRec-Bench: 5,000+ entries for medical image classification

PathoSage Introduces Three‑Stage Framework for Patch‑Level Pathology Reasoning

Apple unveils third‑gen foundation model, AFM 3 Cloud shows 36% boost

NVFP4 recipe speeds JAX/MaxText training on NVIDIA Blackwell and Rubin

Weaker LLMs Accidentally Delete Content, Shrinking Documents Over Time

Further Reading

Related Reading

Hermes Agent tops use as Nous Research’s self‑improving model leads OpenRouter

DeepMind spinoff’s AI‑designed drugs enter human trials after AlphaFold 3

Google AI Advisors Let Users Probe Performance with Conversational “Why” Queries

India proposes licensing and royalty rules for AI training by Google, OpenAI

Gemini 3 Pro builds screenshot-to-code app in two prompts, fixes bugs

COALA paper defines agent memory types: procedural rules and semantic facts

Google DeepMind's Decoupled DiLoCo hits 88% goodput despite hardware failures

Google Cloud Next ’26 launches Agent Studio and Gemini Enterprise AI app

Google Cloud AI launches ReasoningBank with MaTTS memory-aware scaling

Common Questions Answered

How did Vision Banana outperform specialized models like SAM 3 and Depth Anything V3?

What does Vision Banana reveal about image generation pretraining?

What makes Vision Banana's approach different from traditional specialized vision models?

Latest News

AI‑enhanced lessons in Sierra Leone: teachers lead impact study

CoCoNuT paradigm expands residual stream for latent‑space, multi‑path reasoning

OmniMem adds modality-aware memory allocation for audio‑visual LLMs

AI agents solve neuroscience pipeline tasks on datasets larger than benchmarks

ML models predict World Cup outcomes, but miss draws, capture team strength

MedicalRec releases MedicalRec-Bench: 5,000+ entries for medical image classification

PathoSage Introduces Three‑Stage Framework for Patch‑Level Pathology Reasoning

Apple unveils third‑gen foundation model, AFM 3 Cloud shows 36% boost

NVFP4 recipe speeds JAX/MaxText training on NVIDIA Blackwell and Rubin

Weaker LLMs Accidentally Delete Content, Shrinking Documents Over Time