Skip to main content
Graphic showing Attention output GEMM optimization achieving 1.47x speedup in NVFP4 training, illustrating forward propagatio

Editorial illustration for Attention output GEMM reduces blended Fprop speedup to 1.47× in NVFP4 training

Attention output GEMM reduces blended Fprop speedup to...

Attention output GEMM reduces blended Fprop speedup to 1.47× in NVFP4 training

3 min read

Transformers power most large‑language and generative‑AI systems today. As models swell, the GPU hours required for a single training run climb dramatically, and the time engineers spend iterating on experiments stretches out. That makes any speedup more than a convenience—it directly expands the size of models teams can afford to train.

NVIDIA’s Hopper and Blackwell GPUs answer that pressure by adding low‑precision operator support, notably FP8 and NVFP4. Because a transformer’s training loop is dominated by matrix‑multiply (GEMM) operations, cutting the precision of those multiplications can shave both latency and cost. But the precision a model actually runs at isn’t obvious from a high‑level config; you have to translate the transformer settings and batch size into the exact M × K × N shapes the hardware will process, then benchmark each shape across precisions.

NVIDIA’s Transformer Engine automates quantization and kernel selection, making that translation feasible. The walkthrough below uses a 5‑billion‑parameter CodonFM model—an RNA‑focused language model—to show how to move from abstract settings to concrete GEMM workloads, profile them, and decide whether low‑precision formats will deliver real training speedups.

Once you include the attention output GEMM, the blended Fprop speedup drops to 1.47x. After adding Wgrad times, non-GEMM overhead and NVFP4-specific quantization costs, the end-to-end gap between NVFP4 and MXFP8 in training is consistent with these kernel-level numbers. - FP8 DelayedScaling is surprisingly competitive on NVIDIA Blackwell.

At 7.80 ms/layer in autocast mode, it outperforms both FP8 CurrentScaling (9.15 ms) and MXFP8 (8.98 ms). In prequantized mode FP8 CurrentScaling pulls ahead (6.81 ms versus 8.12 ms), suggesting the DelayedScaling amax-history approach has lower quantization overhead but similar raw kernel throughput. This is a good example of the comparison between autocast and prequantized surfacing different winners depending on whether you measure with or without the quantization tax.

- The prequantized results reveal the true kernel potential. Running with --pre-quantize removes quantization overhead entirely, and NVFP4 versus BF16 jumps from 1.98x (autocast) to 3.48x (kernel-only). This shows the FP4 tensor cores are delivering real speedups.

Why this matters We see that NVIDIA’s Hopper and Blackwell GPUs now support low‑precision operators aimed at trimming transformer training costs. Yet the headline figure—1.47× blended forward‑propagation speedup once attention‑output GEMM is counted—suggests the gains are modest. Adding Wgrad times, non‑GEMM overhead and NVFP4‑specific quantization costs narrows the end‑to‑end gap between NVFP4 and MXFP8, aligning it with kernel‑level expectations.

Does this mean developers can reliably shave hours off large‑scale runs? Possibly, but the data also hint that the promised acceleration may be constrained by the attention output stage and quantization overheads. For founders budgeting GPU time, the improvement is tangible but not transformative; careful profiling will be required to confirm net benefits for specific models.

Researchers should note that while low‑precision pathways are now available, the actual training speed depends on how much of the workload falls into the optimized kernels versus the less‑efficient components. In short, the hardware advances are real, yet the overall impact remains bounded by the detailed cost breakdown presented.

Further Reading