Skip to main content
AMD showcases Llama 3.1 8B pretraining benchmark on MLPerf, demonstrating AI model training with random weights for machine l

Editorial illustration for AMD builds Llama 3.1 8B pretraining benchmark for MLPerf, using random weights

AMD builds Llama 3.1 8B pretraining benchmark for...

AMD builds Llama 3.1 8B pretraining benchmark for MLPerf, using random weights

2 min read

AMD has posted a detailed guide for anyone wanting to reproduce its MLPerf Training v6.0 results. The company entered three benchmarks this round: Llama 2 70B LoRA fine‑tuning on GovReport data, Llama 3.1 8B pretraining from random weights using a slice of the C4 corpus, and the Flux.1‑schnell text‑to‑image workload on an eight‑node MI325X cluster. All three submissions ran on AMD Instinct MI325X, MI350X or MI355X GPUs, and for the first time the Primus training framework powered both LLM runs.

The blog walks readers through environment setup, dataset preparation, training configuration, execution and result validation. You’ll need a supported ROCm 7.2.2+ stack, Docker, Slurm for the multinode Flux job, and at least 6 TB of storage for the dataset. The instructions assume a Linux host and a compatible AMD GPU.

For deeper performance analysis, AMD points to a companion technical dive. Follow the steps, and you should be able to match the numbers AMD reported to MLPerf.

Ensure that $LOGDIR has write access for the results to be written by running sudo chmod -R 777 $LOGDIR , In this example the folder /data/mlperf_llama31_8b/results is used as the results directory, so please make sure to create this directory.

Why this matters

We can now reproduce AMD’s MLPerf Training 6.0 results, including the Llama 3.1 8B pretraining benchmark that starts from random weights. For developers, the availability of a step‑by‑step guide means we can test our own hardware against a known baseline without hunting for checkpoint conversions. Founders may see a clearer cost picture for training a midsize Llama model on a subset of the C4 dataset, though the guide does not disclose exact resource consumption.

Researchers gain a reproducible reference point for full‑model pretraining, which could help isolate algorithmic improvements from hardware quirks. Yet, the benchmark uses only a portion of C4, so it remains unclear whether results will scale to the full dataset or larger model variants. The inclusion of Flux.1‑schnell alongside the language benchmarks shows AMD’s intent to cover both text and vision workloads, but we lack performance numbers for that task here.

In short, the reproducibility effort lowers the entry barrier for benchmarking, but practical adoption will depend on how well the subset‑based results translate to real‑world training demands.

Further Reading