Graphic showing Attention output GEMM optimization achieving 1.47x speedup in NVFP4 training, illustrating forward propagatio

Editorial illustration for Attention output GEMM reduces blended Fprop speedup to 1.47× in NVFP4 training

Attention output GEMM reduces blended Fprop speedup to...

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 16, 2026 • Updated: July 8, 2026 • 4 min read

The 1.47x speedup for NVFP4 training isn't fake. It's just the real answer after you pay the bill. The bill is for the attention output GEMM, a major piece of work that blends the flashy raw kernel speed back down to earth. Add everything else—weight gradients, other overhead, the cost of quantizing to NVFP4 itself—and the total training gap with MXFP8 matches those sober kernel figures.

But the plot twists on NVIDIA's Blackwell. FP8 DelayedScaling, in autocast mode, finishes a layer in 7.80 ms. That beats both FP8 CurrentScaling and MXFP8.

Switch to prequantized mode, and CurrentScaling wins instead, at 6.81 ms against DelayedScaling's 8.12 ms. This isn't a contradiction. It's diagnostics.

DelayedScaling's trick is lower quantization overhead, not a faster core. Autocast shows you the system cost. Prequantized shows you the engine.

Once you include the attention output GEMM, the blended Fprop speedup drops to 1.47x. After adding Wgrad times, non-GEMM overhead and NVFP4-specific quantization costs, the end-to-end gap between NVFP4 and MXFP8 in training is consistent with these kernel-level numbers. - FP8 DelayedScaling is surprisingly competitive on NVIDIA Blackwell.

At 7.80 ms/layer in autocast mode, it outperforms both FP8 CurrentScaling (9.15 ms) and MXFP8 (8.98 ms). In prequantized mode FP8 CurrentScaling pulls ahead (6.81 ms versus 8.12 ms), suggesting the DelayedScaling amax-history approach has lower quantization overhead but similar raw kernel throughput. This is a good example of the comparison between autocast and prequantized surfacing different winners depending on whether you measure with or without the quantization tax.

- The prequantized results reveal the true kernel potential. Running with --pre-quantize removes quantization overhead entirely, and NVFP4 versus BF16 jumps from 1.98x (autocast) to 3.48x (kernel-only). This shows the FP4 tensor cores are delivering real speedups.

How to Optimize Transformer-Based Models for Low-Precision Training - NVIDIA Developer Blog

That kernel-only mode is the proof. Run with `--pre-quantize` and the comparison of NVFP4 to BF16 leaps from 1.98x to 3.48x. The FP4 tensor cores are genuinely fast. The quantization process is just expensive.

So the headline 1.47x is the system's answer. The 3.48x is the hardware's potential. Both are true. Optimizing now means knowing which number you're looking at, and which one you actually need to move.

Common Questions Answered

Why does NVFP4 training show only 1.47× speedup instead of the higher kernel-only performance?

The attention output GEMM is a major bottleneck that significantly reduces the overall speedup when accounting for real-world training costs. Additional overhead from weight gradients, quantization conversion to NVFP4 itself, and other system-level factors blend the raw kernel speed down from theoretical maximums to the practical 1.47× speedup achieved in actual training scenarios.

What is the difference between the 1.47× speedup and the 3.48× speedup mentioned for NVFP4?

The 1.47× speedup represents the actual system-level performance when running full NVFP4 training with all overhead costs included. The 3.48× speedup is achieved when using the `--pre-quantize` flag in kernel-only mode, which demonstrates the raw hardware potential of FP4 tensor cores without the expensive quantization process overhead.

How does the quantization process impact NVFP4 training performance?

The quantization process to convert to NVFP4 is a significant cost factor that reduces overall training speedup. Running with `--pre-quantize` eliminates this quantization overhead and reveals that the FP4 tensor cores are genuinely fast, capable of achieving 3.48× speedup, but the standard quantization process accounts for much of the performance gap between theoretical and practical speedups.

What does the attention output GEMM contribute to the overall NVFP4 training bottleneck?

The attention output GEMM is identified as a major piece of work that significantly impacts the blended training speedup. It represents one of the primary factors preventing NVFP4 from achieving its theoretical kernel-level performance in real-world training scenarios, contributing substantially to the reduction from potential 3.48× down to the actual 1.47× system speedup.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

Attention output GEMM reduces blended Fprop speedup to...

Common Questions Answered

Why does NVFP4 training show only 1.47× speedup instead of the higher kernel-only performance?

What is the difference between the 1.47× speedup and the 3.48× speedup mentioned for NVFP4?

How does the quantization process impact NVFP4 training performance?

What does the attention output GEMM contribute to the overall NVFP4 training bottleneck?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

DeepSeek Boosts Agent, Coding Performance in Open-Source V4-Flash Model

Chinese AI Researchers Turn to X for Technical Audience

Thinking Machines' Inkling Small Beats Larger Model on Key Coding Tests

Deepseek's New AI Model Matches GPT-5.6 at 60% Lower Cost

Users Blast AI Assistant as 'Dead-End Relationship' Ad

Anthropic says Claude AI hacked companies during safety test

Anthropic says its AI models breached three companies in security tests

Anthropic Says Configuration Error Let Claude Access Open Internet

Nous Research Ships Three Hermes Agent Integration Paths for Block's Nostr Workspace

PolyAI's Dialog-RSN-1 Fuses Speech Recognition and Response

Related Reading

ChatGPT's 'Nerdy' tweak rewards goblin metaphors in answers, study finds

Google tests visual 'magazine-style' UI for Gemini 3 Pro users

AI Engineers Face Rising Costs, Need New Strategies for Efficiency

Estonian institute benchmarks AI models' vulnerability to Russian propaganda

Study quantifies AI agent trust formation, breakage, recovery in survival game

Common Questions Answered

Why does NVFP4 training show only 1.47× speedup instead of the higher kernel-only performance?

What is the difference between the 1.47× speedup and the 3.48× speedup mentioned for NVFP4?

How does the quantization process impact NVFP4 training performance?

What does the attention output GEMM contribute to the overall NVFP4 training bottleneck?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

DeepSeek Boosts Agent, Coding Performance in Open-Source V4-Flash Model

Chinese AI Researchers Turn to X for Technical Audience

Thinking Machines' Inkling Small Beats Larger Model on Key Coding Tests

Deepseek's New AI Model Matches GPT-5.6 at 60% Lower Cost

Users Blast AI Assistant as 'Dead-End Relationship' Ad

Anthropic says Claude AI hacked companies during safety test

Anthropic says its AI models breached three companies in security tests

Anthropic Says Configuration Error Let Claude Access Open Internet

Nous Research Ships Three Hermes Agent Integration Paths for Block's Nostr Workspace

PolyAI's Dialog-RSN-1 Fuses Speech Recognition and Response