Skip to main content
NVIDIA engineers celebrate in a data center, holding a Blackwell GPU as a monitor displays MLPerf training scores in FP4.

NVIDIA Blackwell Wins All MLPerf Training v5.1 Benchmarks with FP4 Accuracy

2 min read

When you look at MLPerf’s training suite, it’s basically the go-to way to see how fast a chip can turn raw compute into a working model. The v5.1 round is pushing teams to wring every drop of efficiency from GPUs while still meeting the same accuracy that the reference runs hit. Up to now most entries have stuck with FP16 or BF16 - they’re fast enough and still keep the numbers stable for big language models.

So a claim about dropping to an even lower-precision type catches my eye; if the results check out, data centers might rethink both cost and power usage. NVIDIA’s new Blackwell GPU managed to run the whole test set and, oddly enough, knocked out a 10-minute training pass for Llama 3.1’s 405-billion-parameter model - a record that beats earlier tries. The mix of speed, precision and still-good accuracy has got researchers and cloud operators talking, and it feels like a pretty unique win for the platform.

NVIDIA is the only platform to date that has submitted MLPerf Training results with calculations performed using FP4 precision while meeting the benchmark's strict accuracy requirements. NVIDIA Blackwell Scales to New Heights NVIDIA set a new Llama 3.1 405B time-to-train record of just 10 minutes, powered by more than 5,000 Blackwell GPUs working together efficiently. This entry was 2.7x faster than the best Blackwell-based result submitted in the prior round, resulting from efficient scaling to more than twice the number of GPUs, as well as the use of NVFP4 precision to dramatically increase the effective performance of each Blackwell GPU. To illustrate the performance increase per GPU, NVIDIA submitted results this round using 2,560 Blackwell GPUs, achieving a time to train of 18.79 minutes -- 45% faster than the submission last round using 2,496 GPUs.

Related Topics: #NVIDIA Blackwell #MLPerf Training #FP4 #NVFP4 #Llama 3.1 #405‑billion‑parameter #GPUs #FP16 #BF16

It looks like NVIDIA may have nudged the AI-training ceiling higher. In the newest MLPerf Training v5.1 round the Blackwell platform took all seven tests, posting the quickest time-to-train the suite has seen. Interestingly, it’s the only system that reported results in FP4 precision while still meeting the benchmark’s tight accuracy limits, something the release calls out.

The Llama 3.1 405B model, for example, finished training in roughly ten minutes, a number that hints at the power of the new GPUs, CPUs, NICs, networking gear and a handful of algorithm tweaks. Still, the report doesn’t give us side-by-side numbers, so it’s hard to say if other platforms can keep up. The feat is certainly eye-catching, yet we don’t yet know how it will affect everyday workloads.

I’m watching to see whether this speed will turn into a cost-effective, scalable solution outside the lab setting.

Further Reading

Common Questions Answered

What precision format did NVIDIA Blackwell use to win all MLPerf Training v5.1 benchmarks?

NVIDIA Blackwell used FP4 precision for its MLPerf Training v5.1 submissions. This marks the first time a platform has met the benchmark's strict accuracy thresholds while performing calculations with FP4, a lower‑precision format than the traditionally used FP16 or BF16.

How fast did the Blackwell platform train the Llama 3.1 405B model in the MLPerf v5.1 round?

The Blackwell platform trained the Llama 3.1 405B model in just ten minutes. This record was achieved using more than 5,000 Blackwell GPUs working together efficiently, making it 2.7× faster than the previous best Blackwell‑based result.

Why is the FP4 precision achievement significant for AI training benchmarks?

FP4 precision reduces the amount of data processed per operation, allowing higher throughput and lower power consumption. Demonstrating that FP4 can still satisfy MLPerf's strict accuracy requirements proves that lower‑precision formats can be viable for large‑scale language model training without sacrificing model quality.

What does the MLPerf Training v5.1 suite measure, and how did Blackwell perform across its tests?

MLPerf Training v5.1 measures the time‑to‑train models while maintaining a predefined accuracy bar set by reference implementations. Blackwell swept all seven tests in the suite, posting the fastest time‑to‑train for each and becoming the only system to submit results using FP4 precision.