Skip to main content
NVIDIA engineers celebrate in a data center, holding a Blackwell GPU as a monitor displays MLPerf training scores in FP4.

Editorial illustration for NVIDIA Blackwell Sweeps MLPerf Training Benchmarks with Groundbreaking FP4 Precision

NVIDIA Blackwell Shatters MLPerf Records with FP4 AI Chip

NVIDIA Blackwell Wins All MLPerf Training v5.1 Benchmarks with FP4 Accuracy

Updated: 2 min read

In the high-stakes world of AI computing, benchmark tests can make or break a company's reputation. NVIDIA just dropped a bombshell in machine learning performance, dominating the latest MLPerf Training v5.1 benchmarks with its Blackwell architecture.

The company's newest chip isn't just another incremental upgrade, it's a potential game-changer for AI training speeds. By introducing FP4 precision calculations, NVIDIA has potentially unlocked a new frontier of computational efficiency that could reshape how complex AI models are developed.

Researchers and tech companies live and die by training times, where every minute saved translates to massive cost and resource reductions. NVIDIA's latest results suggest they're not just competing in this space, they're rewriting the rules.

The most eye-catching proof? A staggering 10-minute training time for the massive Llama 3.1 405B model. This isn't just fast, it's unusual in the AI training landscape.

NVIDIA is the only platform to date that has submitted MLPerf Training results with calculations performed using FP4 precision while meeting the benchmark's strict accuracy requirements. NVIDIA Blackwell Scales to New Heights NVIDIA set a new Llama 3.1 405B time-to-train record of just 10 minutes, powered by more than 5,000 Blackwell GPUs working together efficiently. This entry was 2.7x faster than the best Blackwell-based result submitted in the prior round, resulting from efficient scaling to more than twice the number of GPUs, as well as the use of NVFP4 precision to dramatically increase the effective performance of each Blackwell GPU. To illustrate the performance increase per GPU, NVIDIA submitted results this round using 2,560 Blackwell GPUs, achieving a time to train of 18.79 minutes -- 45% faster than the submission last round using 2,496 GPUs.

NVIDIA's Blackwell architecture is rewriting performance benchmarks in AI training. The platform's breakthrough comes through unusual efficiency, setting a new Llama 3.1 405B training record of just 10 minutes using over 5,000 GPUs.

What stands out is NVIDIA's unique achievement in MLPerf Training v5.1: they're currently the only platform submitting results using FP4 precision while maintaining strict accuracy standards. This isn't just incremental improvement; it's a significant leap in computational efficiency.

The 2.7x speedup compared to their previous Blackwell-based result suggests remarkable scaling capabilities. Such performance gains could have profound implications for large language model development, potentially reducing training times and computational costs.

Still, questions remain about real-world application beyond benchmarks. How will these theoretical gains translate to practical AI development? While the technical achievement is impressive, the true test will be widespread adoption and consistent performance across diverse workloads.

For now, NVIDIA has set a new high-water mark in AI training technology - one that competitors will undoubtedly be studying closely.

Further Reading

Common Questions Answered

How did NVIDIA's Blackwell architecture perform in the MLPerf Training v5.1 benchmarks?

NVIDIA dominated the MLPerf Training v5.1 benchmarks by introducing FP4 precision calculations and being the only platform to submit results meeting strict accuracy requirements. The company set a remarkable Llama 3.1 405B training record of just 10 minutes using over 5,000 Blackwell GPUs, which was 2.7x faster than their previous best result.

What makes NVIDIA's FP4 precision calculations significant in AI training?

NVIDIA's FP4 precision is groundbreaking because it represents a new frontier of computational efficiency in AI training. The company is currently the only platform able to submit MLPerf Training results using FP4 precision while maintaining the benchmark's stringent accuracy standards, which could potentially revolutionize AI computational performance.

What was NVIDIA's specific achievement with the Llama 3.1 405B model?

NVIDIA achieved an unprecedented time-to-train record of just 10 minutes for the Llama 3.1 405B model, utilizing more than 5,000 Blackwell GPUs working together efficiently. This result was 2.7x faster than their previous best submission, demonstrating significant scaling and performance improvements in AI model training.