Editorial illustration for Fused kernels boost MoE training, forward and backward passes up to 1.3×
Fused kernels boost MoE training, forward and backward...
Fused kernels boost MoE training, forward and backward passes up to 1.3×
Mixture‑of‑experts models are now a staple of large‑scale AI, letting engineers expand capacity while only a slice of parameters fires for each token. That efficiency makes them attractive, but as models grow the MoE block itself can become a choke point. Here's the thing: NVIDIA’s team has built advanced fused MLP kernels—both dense and MoE‑specific—using the CuTe DSL to attack memory and synchronization stalls head‑on.
While the kernels shave 1.3× to 2× off raw compute time, they also enable sync‑free execution inside full‑iteration CUDA Graphs. The payoff shows up in real workloads: DeepSeek‑V3 pre‑training sees an 8 % end‑to‑end speed gain, and GPT‑OSS pre‑training clocks a 93 % overall improvement. These kernels are already in the cuDNN Frontend and can be pulled in through Transformer Engine or Megatron‑Core.
To get here, the engineers mapped the MoE iteration timeline, isolated three dominant bottlenecks, and rewrote the stack with a hardware‑aware software approach that keeps Tensor Cores constantly busy. The result is a more streamlined path from model design to training throughput.
From Kernel-level gains to pretraining speedups Across unit-level microbenchmarks, these fused kernels deliver a substantial speedup--accelerating the forward pass by up to 1.3x and the backward pass by up to 2.1x compared to traditional unfused execution paths. In order to translate these speedups to end-to-end training throughput boost, they also support features such as: - Dynamic Scheduling to support efficient overlap with other kernels such as communication from expert parallelism, data parallelism, etc. - Configurable Cluster Margin to allow users to reserve a configurable margin of SM resources by limiting the kernel to fewer SMs, which leaves headroom for other kernels to launch and execute concurrently on the GPU.
Why this matters
We’ve seen MoE models become a staple for scaling AI without blowing up compute budgets, thanks to their selective activation of parameters. The new fused kernels promise to tighten that budget further, delivering up to 1.3× faster forward passes and as much as 2.1× acceleration on the backward pass in microbenchmarks. That’s a noticeable lift for anyone wrestling with token‑level throughput.
Yet the article notes the need to “translate these speedups” to full‑scale pretraining, leaving it unclear whether end‑to‑end training time will shrink proportionally. For developers, the immediate benefit is a more efficient execution path for MoE layers, potentially freeing cycles for larger experiments. Founders might view the improvement as a modest cost‑saving lever rather than a breakthrough that reshapes budgeting.
Researchers should verify the gains on their own workloads, especially as model sizes keep climbing. In short, the kernels are a concrete optimization, but their real‑world impact remains to be measured across diverse training pipelines.
Further Reading
- AMPIPE: Accelerating MoE Model Training with Intra-Block Pipeline Parallelism - OpenReview / arXiv-style paper
- Inside the optimization of FP8 training on Ironwood - Google Developer Forum
- Mechanics of FP8 for LLMs - Independent technical article
- Domino: Eliminating Communication in LLM Training via Generic Overlapping of Communication and Computation - arXiv
- BLaST: High Performance Inference and Pretraining using Block Sparsity - ETH Zurich / academic paper