Editorial illustration for AMD's MI355X CDNA4 GPU Shows Competitive Training Times in MLPerf v6.0
AMD's MI355X CDNA4 GPU Shows Competitive Training Times...
AMD's MI355X CDNA4 GPU Shows Competitive Training Times in MLPerf v6.0
AMD has laid out its MLPerf Training v6.0 results, showcasing how the latest Instinct GPUs perform on three high‑profile benchmarks. The submission covers Llama 2 70B LoRA fine‑tuning, Llama 3.1 8B pretraining, and FLUX.1 Schnell text‑to‑image pretraining, running on the MI325X, MI350X and the flagship MI355X. While the benchmark suite is widely regarded as the most rigorous public test for AI training workloads, AMD’s entry highlights three firsts: a production‑ready MXFP4 (FP4) recipe for large language models, the debut of the Primus training framework within an MLPerf submission, and the company’s inaugural multi‑node results.
Those multi‑node runs matter because real‑world AI training typically scales across clusters rather than staying on a single server. Together, the milestones suggest the MI355X’s native FP4 hardware and AMD’s software stack are reaching a level of maturity that could appeal to customers evaluating AI infrastructure. Details on reproducing the results are available in AMD’s accompanying blog post.
Summary# AMD's MLPerf Training v6.0 submission demonstrates continued progress across both hardware and software. On the hardware side, the CDNA4-generation MI355X delivers competitive time-to-train results against NVIDIA B200 on both single-node LLM benchmarks -- Llama 2 70B LoRA fine-tuning and Llama 3.1 8B pretraining -- at an iso-GPU count of 8, while the MI325X powers an 8-node Flux.1 Schnell text-to-image submission. On the software side, this round marks the first MLPerf Training submission powered by Primus, AMD's unified training framework, used across both LLM benchmarks, alongside the debut of a production MXFP4 training recipe on the MI355X's native FP4 hardware.
Why this matters
AMD’s latest MLPerf Training v6.0 numbers give us a concrete data point on how the new CDNA4‑generation MI355X stacks up against the current NVIDIA B200 in single‑node LLM workloads. Does the MI355X truly narrow the gap with NVIDIA’s B200? The MI355X achieved competitive time‑to‑train on both Llama 2 70B LoRA fine‑tuning and Llama 3.1 8B pretraining, while the broader submission also covered FLUX.1 Schnell text‑to‑image pretraining across the MI325X, MI350X and MI355X.
For developers focused on scaling large language models, the results suggest an alternative path that may avoid exclusive reliance on NVIDIA hardware. Yet the report stops short of detailing power efficiency, cost per training run, or software ecosystem maturity, leaving us to wonder whether the performance edge will translate into real‑world productivity gains. Moreover, the benchmarks represent a limited slice of the training spectrum; it’s unclear whether similar gains will appear on multi‑node or mixed‑precision scenarios.
As we evaluate our own infrastructure choices, we should weigh these early figures against the broader, still‑unanswered questions about integration effort and long‑term support.
Further Reading
- AMD Delivers Breakthrough MLPerf Inference 6.0 Results - AMD
- AMD Instinct MI355X Achieves MLPerf Inference v6.0 Gains with Over 1 Million Tokens Per Second and Supports Scalable ROCm Stack - StorageReview
- MLPerf Inference v6.0: Dell Showcases Breakthrough Performance with AMD Instinct MI355X GPUs - Dell Technologies
- AMD Instinct MI355X Examining Next-Generation Enterprise AI Performance - Signal65
- MLPerf Inference v6 Benchmark Results 2026 - Spheron