Mixture‑of‑Experts AI Models Run 10× Faster on NVIDIA Blackwell NVL72
Mixture‑of‑Experts (MoE) architectures have become the backbone of many of the most capable frontier AI models, yet their computational appetite often forces developers into costly hardware compromises. The open‑source community has watched the rollout of NVIDIA’s Blackwell line with cautious optimism, hoping the promised efficiency gains will translate into real‑world speedups for large‑scale language systems. Early benchmarks suggest the GB200 NVL72 chip could be a turning point, especially for models that rely on expert routing to scale parameters without a linear increase in latency.
If the hardware can indeed deliver an order‑of‑magnitude improvement over the previous Hopper generation, developers could push deeper, more nuanced MoE configurations into production without the usual trade‑offs. That prospect becomes even more intriguing when paired with custom software tweaks from partners like Together AI, whose optimizations aim to squeeze additional performance out of the silicon. The following remarks from NVIDIA’s founder and CEO at the recent GTC event in Washington, D.C., put those expectations into sharper focus.
At NVIDIA GTC Washington, D.C., NVIDIA founder and CEO Jensen Huang highlighted how GB200 NVL72 delivers 10x the performance of NVIDIA Hopper for DeepSeek-R1, and this performance extends to other DeepSeek variants as well. "With GB200 NVL72 and Together AI's custom optimizations, we are exceeding customer expectations for large-scale inference workloads for MoE models like DeepSeek-V3," said Vipul Ved Prakash, cofounder and CEO of Together AI. "The performance gains come from NVIDIA's full-stack optimizations coupled with Together AI Inference breakthroughs across kernels, runtime engine and speculative decoding." This performance advantage is evident across other frontier models.
Do these speed gains matter? The data shows ten‑fold acceleration for Kimi K2 Thinking, DeepSeek‑R1, Mistral Large 3 and several other frontier models when they run on NVIDIA’s Blackwell NVL72. Yet the claim rests on NVIDIA’s own benchmarks presented at GTC in Washington, D.C., where CEO Jensen Huang highlighted the GB200 NVL72’s performance over the previous Hopper generation.
Because every top‑10 open‑source model examined employs a mixture‑of‑experts architecture, the hardware advantage appears tied to that design. However, the report does not disclose how the speedup translates to real‑world workloads beyond the cited variants. And it remains unclear whether the same optimizations will hold for models that deviate from the MoE pattern.
The statement that “with GB200 NVL72 and Together AI’s custom optimizations, we are exceeding c…” is incomplete, leaving the exact performance target ambiguous. In short, the hardware‑software pairing delivers impressive numbers for a select set of MoE models, but broader applicability and long‑term impact have yet to be demonstrated.
Further Reading
- Papers with Code - Latest NLP Research - Papers with Code
- Hugging Face Daily Papers - Hugging Face
- ArXiv CS.CL (Computation and Language) - ArXiv
Common Questions Answered
What performance improvement does the GB200 NVL72 chip provide for DeepSeek‑R1 compared to the Hopper generation?
At NVIDIA GTC, Jensen Huang announced that the GB200 NVL72 delivers ten times the performance of the Hopper generation for DeepSeek‑R1. This ten‑fold boost is highlighted as a key advantage for large‑scale inference workloads on MoE models.
Which frontier AI models are reported to achieve ten‑fold acceleration when running on NVIDIA’s Blackwell NVL72?
The article cites ten‑fold acceleration for Kimi K2 Thinking, DeepSeek‑R1, Mistral Large 3, and several other frontier models when they run on the Blackwell NVL72. These speed gains are based on NVIDIA’s own benchmarks presented at GTC.
How does Together AI contribute to the performance gains of MoE models on the GB200 NVL72?
Vipul Ved Prakash, co‑founder of Together AI, said that custom optimizations from Together AI combined with the GB200 NVL72 exceed customer expectations for large‑scale inference workloads. These optimizations specifically enhance MoE models such as DeepSeek‑V3.
Why are Mixture‑of‑Experts (MoE) architectures especially relevant to the hardware advantages of the Blackwell NVL72?
Every top‑10 open‑source model examined in the article uses a Mixture‑of‑Experts architecture, making them highly compute‑intensive. The Blackwell NVL72’s efficiency gains therefore translate directly into substantial speedups for these MoE‑based models.