NVIDIA Blackwell and Rubin GPUs accelerating JAX/MaxText training with NVFP4 recipe for faster AI model development and perfo

Editorial illustration for NVFP4 recipe speeds JAX/MaxText training on NVIDIA Blackwell and Rubin

NVFP4 recipe speeds JAX/MaxText training on NVIDIA...

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 8, 2026 • Updated: July 15, 2026 • 3 min read

Training large language models at 4-bit precision without sacrificing accuracy has long seemed like a zero-sum game, until now. The NVFP4 recipe rewrites the rules, delivering a massive throughput leap on NVIDIA Blackwell and the upcoming Rubin platform while preserving convergence. How?

By rethinking the fundamentals: micro block scaling halves the outlier problem, and E4M3 block scale factors layer precise mantissa-based scaling under a per-tensor FP32 anchor. The result is stark, MXFP4 needs about 36% more tokens to match NVFP4’s final loss on an 8B-parameter, 1T-token run. This isn’t just an incremental improvement; it’s a blueprint for training models faster with JAX and MaxText, unlocking real performance where it counts.

To address this, the NVFP4 training recipe in TransformerEngine uses subbyte precision for JAX pretraining. For an end-to-end example, see the recipe in MaxText, a high-performance, scalable LLM framework library. The result is high-throughput, 4-bit mixed-precision pre-training on NVIDIA Blackwell with no measurable accuracy loss compared to the FP8 baseline.

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell - NVIDIA Developer Blog

The result is a recipe that does not just squeeze more speed from the hardware, it redefines what’s possible at 4-bit precision. By halving the block size and reclaiming mantissa bits for the scale factor, NVFP4 sidesteps the compromises that have long held narrow-precision training back. The gap is stark: MXFP4 needs 36% more tokens to match NVFP4’s final loss.

That is not a marginal gain. It is a structural advantage that compounds at scale, turning every hour of training on Blackwell or Rubin into a step closer to convergence rather than a tax paid to numerical instability. And because the recipe is built into JAX and MaxText, it lands in the hands of practitioners without friction.

As models grow and hardware accelerates, the bottleneck shifts from raw FLOPs to the efficiency of each FLOP. NVFP4 attacks that bottleneck head-on. The next plateau in AI training will not be reached by brute force alone, it will be reached by precision that wastes nothing.

Common Questions Answered

What is the NVFP4 recipe and how does it improve 4-bit precision training?

The NVFP4 recipe is a training approach that enables large language models to be trained at 4-bit precision without sacrificing accuracy or convergence. It achieves this by implementing micro block scaling that halves the outlier problem and reclaims mantissa bits for the scale factor, fundamentally rethinking how narrow-precision training operates.

How does NVFP4 compare to MXFP4 in terms of training efficiency?

NVFP4 significantly outperforms MXFP4, requiring 36% fewer tokens to achieve the same final loss. This structural advantage compounds at scale, making it a substantial improvement rather than a marginal gain when training on NVIDIA Blackwell or Rubin platforms.

Which NVIDIA hardware platforms support the NVFP4 recipe?

The NVFP4 recipe is optimized for NVIDIA Blackwell and the upcoming Rubin platform, delivering massive throughput improvements on these architectures. The recipe is specifically designed to work with JAX/MaxText training frameworks on these next-generation NVIDIA processors.

What technical innovation allows NVFP4 to maintain accuracy at 4-bit precision?

NVFP4 uses micro block scaling to reduce outlier problems and reallocates mantissa bits to the scale factor, addressing the fundamental limitations that have historically constrained narrow-precision training. This approach sidesteps the traditional compromises between speed and accuracy that plagued previous 4-bit training methods.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

NVFP4 recipe speeds JAX/MaxText training on NVIDIA...

Common Questions Answered

What is the NVFP4 recipe and how does it improve 4-bit precision training?

How does NVFP4 compare to MXFP4 in terms of training efficiency?

Which NVIDIA hardware platforms support the NVFP4 recipe?

What technical innovation allows NVFP4 to maintain accuracy at 4-bit precision?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI adds voice control to desktop Codex and ChatGPT

New Bill Would Let US Government Order Shutdown of AI Systems

Andrew Ng's OpenWorker Desktop AI Returns Finished Work, Uses Local Models

Florida Pastor Sues OpenAI Over Alleged ChatGPT Medical Advice

Survey Finds RAG Is the Default Context Source for Enterprise AI Agents

AMD Unveils Helios AI Rack System for Data Centers

Free ChatGPT Users Get Worse Health Advice From Older AI Model

Black Forest Labs Launches FLUX 3 for Images and Audio-Video

Multi-turn attacks break AI models 88% of the time, Cisco warns

Over Half of Enterprises Report AI Agent Security Incidents

Related Reading

ChatGPT's 'Nerdy' tweak rewards goblin metaphors in answers, study finds

Google tests visual 'magazine-style' UI for Gemini 3 Pro users

AI Engineers Face Rising Costs, Need New Strategies for Efficiency

NVIDIA and Google Cloud let developers scale AI from prototype to production

NVIDIA NeMo powers telco reasoning model for autonomous network workflows

Weaker LLMs Accidentally Delete Content, Shrinking Documents Over Time

Four New Specific Techniques to Boost Productivity with Claude Code

SpaceX inks USD 920 M/month deal with Google for 110,000 Nvidia AI chips

Nvidia plans PC reboot, Apple unveils smart glasses on Vergecast

Common Questions Answered

What is the NVFP4 recipe and how does it improve 4-bit precision training?

How does NVFP4 compare to MXFP4 in terms of training efficiency?

Which NVIDIA hardware platforms support the NVFP4 recipe?

What technical innovation allows NVFP4 to maintain accuracy at 4-bit precision?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI adds voice control to desktop Codex and ChatGPT

New Bill Would Let US Government Order Shutdown of AI Systems

Andrew Ng's OpenWorker Desktop AI Returns Finished Work, Uses Local Models

Florida Pastor Sues OpenAI Over Alleged ChatGPT Medical Advice

Survey Finds RAG Is the Default Context Source for Enterprise AI Agents

AMD Unveils Helios AI Rack System for Data Centers

Free ChatGPT Users Get Worse Health Advice From Older AI Model

Black Forest Labs Launches FLUX 3 for Images and Audio-Video

Multi-turn attacks break AI models 88% of the time, Cisco warns

Over Half of Enterprises Report AI Agent Security Incidents