AI assistant is currently unavailable. Alternative content delivery method activated.
Research & Benchmarks

NVIDIA Blackwell Wins All MLPerf Training v5.1 Benchmarks with FP4 Accuracy

2 min read

MLPerf’s training suite has become the de‑facto yardstick for measuring how quickly a hardware platform can turn raw compute into a usable model. The v5.1 round pushes participants to squeeze every ounce of efficiency out of GPUs while still hitting the same accuracy bar set by the reference implementations. Historically, most submissions have relied on FP16 or BF16, formats that balance speed with the numerical fidelity required for large‑scale language models.

That makes any claim of using a lower‑precision datatype especially noteworthy—if the numbers hold up, it could reshape how data centers think about cost and power. NVIDIA’s latest Blackwell GPU not only tackled the full benchmark roster but also posted a 10‑minute training run for Llama 3.1’s 405‑billion‑parameter model, a record that eclipses previous attempts. The combination of speed, precision and accuracy has sparked a buzz among researchers and cloud operators alike, setting the stage for a bold statement about the platform’s unique achievement.

NVIDIA is the only platform to date that has submitted MLPerf Training results with calculations performed using FP4 precision while meeting the benchmark's strict accuracy requirements. NVIDIA Blackwell Scales to New Heights NVIDIA set a new Llama 3.1 405B time-to-train record of just 10 minutes, powered by more than 5,000 Blackwell GPUs working together efficiently. This entry was 2.7x faster than the best Blackwell-based result submitted in the prior round, resulting from efficient scaling to more than twice the number of GPUs, as well as the use of NVFP4 precision to dramatically increase the effective performance of each Blackwell GPU. To illustrate the performance increase per GPU, NVIDIA submitted results this round using 2,560 Blackwell GPUs, achieving a time to train of 18.79 minutes -- 45% faster than the submission last round using 2,496 GPUs.

Related Topics: #NVIDIA Blackwell #MLPerf Training #FP4 #NVFP4 #Llama 3.1 #405‑billion‑parameter #GPUs #FP16 #BF16

Did NVIDIA just set a new bar for AI training? In the latest MLPerf Training v5.1 round the Blackwell platform swept all seven tests, posting the fastest time‑to‑train across the suite. It is the only system to submit results using FP4 precision while still satisfying the benchmark’s strict accuracy thresholds, a detail the release highlights.

The Llama 3.1 405B model was trained in just ten minutes, a figure that underscores the scale of the hardware and software advances claimed—new GPUs, CPUs, NICs, networking and algorithmic tweaks all play a part. Yet the report offers no comparative data, so whether competing platforms can match or exceed these numbers remains unclear. The achievement is impressive, but the broader impact on real‑world workloads is still uncertain.

As the industry watches, the question will be whether this performance translates into cost‑effective, scalable solutions beyond the controlled benchmark environment.

Further Reading

Common Questions Answered

What precision format did NVIDIA Blackwell use to win all MLPerf Training v5.1 benchmarks?

NVIDIA Blackwell used FP4 precision for its MLPerf Training v5.1 submissions. This marks the first time a platform has met the benchmark's strict accuracy thresholds while performing calculations with FP4, a lower‑precision format than the traditionally used FP16 or BF16.

How fast did the Blackwell platform train the Llama 3.1 405B model in the MLPerf v5.1 round?

The Blackwell platform trained the Llama 3.1 405B model in just ten minutes. This record was achieved using more than 5,000 Blackwell GPUs working together efficiently, making it 2.7× faster than the previous best Blackwell‑based result.

Why is the FP4 precision achievement significant for AI training benchmarks?

FP4 precision reduces the amount of data processed per operation, allowing higher throughput and lower power consumption. Demonstrating that FP4 can still satisfy MLPerf's strict accuracy requirements proves that lower‑precision formats can be viable for large‑scale language model training without sacrificing model quality.

What does the MLPerf Training v5.1 suite measure, and how did Blackwell perform across its tests?

MLPerf Training v5.1 measures the time‑to‑train models while maintaining a predefined accuracy bar set by reference implementations. Blackwell swept all seven tests in the suite, posting the fastest time‑to‑train for each and becoming the only system to submit results using FP4 precision.