Editorial illustration for NVIDIA Blackwell scales to 8,192 GPUs on DeepSeek‑V3 671B for MLPerf 6.0
NVIDIA Blackwell scales to 8,192 GPUs on DeepSeek‑V3...
NVIDIA Blackwell scales to 8,192 GPUs on DeepSeek‑V3 671B for MLPerf 6.0
NVIDIA’s Blackwell architecture is now the centerpiece of the biggest MLPerf Training 6.0 submission to date. While the suite includes a range of models, the spotlight falls on DeepSeek‑V3 671B—the largest mixture‑of‑experts (MoE) model in the benchmark. To train it, NVIDIA assembled an 8,192‑GPU cluster built on GB200 NVL72 systems, marking the highest‑scale Blackwell deployment ever recorded. The same hardware family powered a 5,120‑GPU run on Llama 3.1 405B, one of the suite’s biggest dense language models.
Two networking options underpinned the effort: NVIDIA Quantum InfiniBand and NVIDIA Spectrum‑X Ethernet, giving data‑center operators flexibility in how they stitch together such massive arrays. Microsoft Azure pushed Llama 3.1 405B to the same 8,192‑GPU limit, hitting the reference quality target in 7.07 minutes—the quickest finish for that test. CoreWeave, using GB300 NVL72 nodes linked with Spectrum‑X, trained DeepSeek‑V3 671B to quality in just 2.02 minutes.
The results illustrate how deep co‑engineering across hardware, networking and software can stretch a single AI chip family to unprecedented training scales.
On DeepSeek-V3 671B, the largest MoE model in the suite, NVIDIA scaled its submission to 8,192 GPUs using GB200 NVL72 systems, the largest-scale Blackwell-based submission in MLPerf Training to date.
NVIDIA also submitted results at 5,120 GPUs with NVIDIA GB200 NVL72 systems on Llama 3.1 405B, one of the largest dense LLMs in the suite.
This round's results also reflect the deep co-engineering between NVIDIA and its partners on system architecture, networking and software:
- Microsoft Azure scaled Llama 3.1 405B training to 8,192 GPUs using GB200 NVL72 systems, and reached the reference quality target in 7.07 minutes, the fastest time to train for this benchmark.
- CoreWeave delivered the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking.
Why this matters
We see NVIDIA pushing the Blackwell architecture to an unprecedented 8,192‑GPU cluster for the DeepSeek‑V3 671B MoE model, the largest submission in MLPerf Training 6.0. The effort hinges on GB200 NVL72 systems and dual networking stacks—Quantum InfiniBand and SpectrumX—showing the company’s confidence in scaling both compute and interconnect. For developers, the result hints at a hardware envelope that can accommodate ever‑larger models without immediate software redesign.
Founders might wonder whether such scale translates into cost‑effective services for end users, given the sheer number of GPUs involved. Researchers gain a data point on how far current silicon can be stretched, yet it remains unclear whether the performance gains stem chiefly from raw GPU count or from optimizations hidden in the submission. The benchmark demonstrates a technical milestone, but practical accessibility for most AI teams is still uncertain.
As we watch these numbers, we must balance enthusiasm for raw scale with a realistic view of the resources required to reproduce it outside a tightly controlled test environment.
Further Reading
- Papers with Code - Latest NLP Research - Papers with Code
- Hugging Face Daily Papers - Hugging Face
- ArXiv CS.CL (Computation and Language) - ArXiv