Skip to main content
NVIDIA Blackwell and Rubin GPUs accelerating JAX/MaxText training with NVFP4 recipe for faster AI model development and perfo

Editorial illustration for NVFP4 recipe speeds JAX/MaxText training on NVIDIA Blackwell and Rubin

NVFP4 recipe speeds JAX/MaxText training on NVIDIA...

NVFP4 recipe speeds JAX/MaxText training on NVIDIA Blackwell and Rubin

2 min read

Why does this matter? When pre‑training frontier LLMs stretches across trillions of tokens and thousands of accelerators, every percentage point of step time translates into days of compute and hefty expense. While the tech is impressive, low‑bit mixed‑precision pre‑training remains notoriously finicky.

That’s where the NVFP4 recipe in TransformerEngine steps in, using sub‑byte precision for JAX workloads. Here’s the thing: the recipe is baked into MaxText, NVIDIA’s high‑performance, scalable LLM framework, and it delivers 4‑bit mixed‑precision training on the NVIDIA Blackwell platform with no measurable accuracy loss versus the FP8 baseline. Native support on the GB300 Grace Blackwell Ultra Superchip boosts GEMM throughput by roughly 7× compared with native FP8 on the Hopper.

The result? Shorter step times, negligible accuracy drift, and the ability for AI factories to push more or larger models within the same time budget. This post walks through the NVFP4 format, its two‑level microscaling, and the performance data that backs the claim, referencing the NVFP4 pre‑training paper for methodological depth.

NVFP4 pretraining recipe The NVFP4 recipe combines several ingredients that together preserve convergence while unlocking NVIDIA Blackwell and the NVIDIA Rubin platform NVFP4 throughput. To enable efficient narrow-precision training, the pretraining recipe uses several key techniques that have been chosen based on their performance and accuracy. Five key ingredients work together while maintaining the accuracy required in 4-bit pretraining: - Micro block scaling uses 16-element blocks, half the size of MXFP4's 32-element blocks, so a single outlier has less influence on the shared scale.

- E4M3 block scale factors uses mantissa bits instead of MXFP4's power-of-two E8M0 scaling, layered under a per-tensor FP32 scale. In an 8B-parameter, 1T token experiment, MXFP4 requires ~36% more tokens to match NVFP4's final loss.

Why this matters

We see a concrete step toward cutting the wall‑clock time of massive LLM pre‑training. NVFP4’s sub‑byte precision, embedded in the TransformerEngine recipe for JAX and MaxText, promises higher throughput on Blackwell and Rubin without sacrificing convergence, at least in the presented example. For developers juggling thousands of accelerators, even a few percent saved per step can translate into days of compute and lower cloud bills.

Yet the article leaves open whether the same stability holds for diverse model architectures or longer training runs. Founders may appreciate the potential cost edge, but they must weigh integration effort against uncertain generality. Researchers gain a new knob to explore, though the recipe’s reliance on several tightly coupled techniques could limit portability to other frameworks.

Can the community replicate these gains without deep expertise in mixed‑precision tricks? We remain cautiously optimistic: the approach demonstrates that precision engineering still yields measurable benefits, but broader adoption will depend on clearer evidence of stability across workloads.

Further Reading