Baidu opens multimodal AI, claims it beats GPT‑5 and Gemini, runs on one 80GB GPU
Baidu’s latest release has stirred the community. The Chinese tech giant just unveiled an open‑source multimodal model that it says outperforms both GPT‑5 and Google’s Gemini, yet it isn’t housed in a sprawling super‑computer. Instead, Baidu positions the system as something a typical corporate IT department could spin up without a massive hardware investment.
While the claim of beating two high‑profile competitors is bold, the real hook lies in the deployment footprint. If a single 80GB graphics card can shoulder the workload, the barrier to entry drops dramatically compared with other offerings that often demand a cluster of top‑tier accelerators. That shift could reshape how enterprises experiment with large‑scale AI, moving the technology from research labs into ordinary data centers.
The documentation provides the technical details that back up this accessibility narrative, and it’s those specifics that make the announcement worth a closer look.
According to Baidu's documentation, the model can run on a single 80GB GPU -- hardware readily available in many corporate data centers -- making it significantly more accessible than competing systems that may require multiple high-end accelerators. The technical documentation reveals that Baidu employed several advanced training techniques to achieve the model's capabilities. The company used "cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency." Baidu also notes that in response to "strong community demand," the company "significantly strengthened the model's grounding performance with improved instruction-following capabilities." The new model fits into Baidu's ambitious multimodal AI ecosystem The new release is one component of Baidu's broader ERNIE 4.5 model family, which the company unveiled in June 2025.
Does the new model truly outpace GPT‑5 and Gemini? Baidu says its ERNIE‑4.5‑VL‑28B‑A3B‑Thinking surpasses those systems on several vision benchmarks, and the documentation lists higher scores on image‑based tests. Yet the evidence is limited to the metrics Baidu chose to publish.
The model runs on a single 80GB GPU, a configuration many data centers already possess, which could lower the barrier to entry compared with multi‑accelerator setups. Still, performance on non‑vision tasks remains unaddressed, and the claim of “beats” is confined to a narrow set of evaluations. The open‑source release invites independent verification, but the technical paper stops short of detailing training data or inference latency.
Consequently, while the accessibility claim appears credible, the broader competitive advantage is uncertain. Observers will need to see third‑party benchmarks before drawing firm conclusions about its standing against the leading multimodal systems. If the model scales as advertised, smaller enterprises could experiment with advanced visual reasoning without the usual hardware outlay.
However, no data on cost per inference or energy consumption has been released, leaving the practical efficiency question open.
Further Reading
- Papers with Code - Latest NLP Research - Papers with Code
- Hugging Face Daily Papers - Hugging Face
- ArXiv CS.CL (Computation and Language) - ArXiv
Common Questions Answered
What is the name of Baidu's new multimodal model that claims to beat GPT‑5 and Gemini?
The model is called ERNIE‑4.5‑VL‑28B‑A3B‑Thinking. Baidu states that this open‑source multimodal system surpasses both GPT‑5 and Google's Gemini on several vision benchmarks.
How does Baidu's deployment footprint differ from typical large‑scale AI systems?
Baidu claims the model can run on a single 80GB GPU, which is commonly found in corporate data centers. This contrasts with competing systems that often require multiple high‑end accelerators, lowering the barrier to entry for many organizations.
Which specific tasks does Baidu say ERNIE‑4.5‑VL‑28B‑A3B‑Thinking outperforms GPT‑5 and Gemini on?
According to Baidu's documentation, the model achieves higher scores on image‑based vision benchmarks. The company highlights superior performance on verifiable multimodal reinforcement learning tasks involving visual data.
What limitations does the article note about Baidu's claims of superiority over GPT‑5 and Gemini?
The article points out that Baidu's evidence is limited to the metrics it chose to publish, focusing mainly on vision tasks. Performance on non‑vision tasks remains unaddressed, leaving uncertainty about overall superiority.