Illustration for: Baidu opens multimodal AI, claims it beats GPT‑5 and Gemini, runs on one 80GB GPU
LLMs & Generative AI

Baidu opens multimodal AI, claims it beats GPT‑5 and Gemini, runs on one 80GB GPU

2 min read

When Baidu rolled out its newest model, the buzz was hard to miss. The Chinese giant says the open-source multimodal system can actually beat both GPT-5 and Google’s Gemini - a claim that feels almost too big to swallow. What’s odd, though, is that it isn’t tied to a massive super-computer.

Baidu pitches it as something a regular corporate IT team could spin up with just an 80GB graphics card, no need for a whole rack of pricey accelerators. If that holds up, the cost hurdle drops from “impossible” to “maybe doable” for many firms. I’m curious about the numbers they’ve dropped in the docs - they lay out latency, memory use and training tricks that supposedly make the whole thing fit on a single GPU.

It’s not clear yet whether the performance edge over GPT-5 and Gemini survives real-world workloads, but the idea of running a large-scale AI model in a standard data center is tempting. That could push a lot of enterprises from tinkering in labs to actually deploying AI in day-to-day operations. The technical details are worth a closer look, even if we remain a bit skeptical.

According to Baidu's documentation, the model can run on a single 80GB GPU -- hardware readily available in many corporate data centers -- making it significantly more accessible than competing systems that may require multiple high-end accelerators. The technical documentation reveals that Baidu employed several advanced training techniques to achieve the model's capabilities. The company used "cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency." Baidu also notes that in response to "strong community demand," the company "significantly strengthened the model's grounding performance with improved instruction-following capabilities." The new model fits into Baidu's ambitious multimodal AI ecosystem The new release is one component of Baidu's broader ERNIE 4.5 model family, which the company unveiled in June 2025.

Related Topics: #Baidu #multimodal AI #GPT‑5 #Gemini #80GB GPU #ERNIE #GSPO #IcePop #MoE training

Baidu claims its ERNIE-4.5-VL-28B-A3B-Thinking is faster than GPT-5 and Gemini on a handful of vision benchmarks, and the paper does list higher numbers on image-based tests. The catch is that we only see the metrics Baidu chose to show. It runs on a single 80 GB GPU, something many data centers already have, which might make it easier to adopt than models that need several accelerators.

But we haven’t seen any results on non-vision tasks, so the “beats” headline feels a bit narrow. The open-source release does let others check the claims, yet the accompanying paper skips details on training data and inference latency. Because of that, the accessibility argument looks plausible, while the overall competitive edge stays fuzzy.

We’ll probably need third-party benchmarks before we can say for sure how it stacks up against the top multimodal systems. If the scaling claims hold, small companies could tinker with advanced visual reasoning without buying huge rigs. Still, no numbers on cost per inference or energy use have been shared, so the efficiency question remains open.

Common Questions Answered

What is the name of Baidu's new multimodal model that claims to beat GPT‑5 and Gemini?

The model is called ERNIE‑4.5‑VL‑28B‑A3B‑Thinking. Baidu states that this open‑source multimodal system surpasses both GPT‑5 and Google's Gemini on several vision benchmarks.

How does Baidu's deployment footprint differ from typical large‑scale AI systems?

Baidu claims the model can run on a single 80GB GPU, which is commonly found in corporate data centers. This contrasts with competing systems that often require multiple high‑end accelerators, lowering the barrier to entry for many organizations.

Which specific tasks does Baidu say ERNIE‑4.5‑VL‑28B‑A3B‑Thinking outperforms GPT‑5 and Gemini on?

According to Baidu's documentation, the model achieves higher scores on image‑based vision benchmarks. The company highlights superior performance on verifiable multimodal reinforcement learning tasks involving visual data.

What limitations does the article note about Baidu's claims of superiority over GPT‑5 and Gemini?

The article points out that Baidu's evidence is limited to the metrics it chose to publish, focusing mainly on vision tasks. Performance on non‑vision tasks remains unaddressed, leaving uncertainty about overall superiority.