Advanced fused kernels accelerating Mixture of Experts (MoE) training with improved forward and backward passes, achieving up

Editorial illustration for Fused kernels boost MoE training, forward and backward passes up to 1.3×

Fused kernels boost MoE training, forward and backward...

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 15, 2026 • Updated: July 15, 2026 • 3 min read

Training a large Mixture-of-Experts model often feels like herding cats on a supercomputer. The GPU is constantly starting and stopping tiny tasks, stuck waiting for messages between experts, and never really working at full capacity. A simple idea is fixing this: fuse the tiny tasks into bigger, more coherent chunks of work.

New fused kernels do exactly that. They bundle up operations to let the GPU work more continuously, delivering up to a 1.3x speedup on forward passes and a startling 2.1x improvement on backward passes. These aren't isolated numbers. The system is designed to play nice with the chaotic reality of distributed training.

To push these boundaries, we are introducing advanced fused MLP kernels for dense and MoE models, custom-built with the CuTe DSL.

Boosting MoE Training Throughput with Advanced Fusion Kernels - NVIDIA Developer Blog

The clever part is in the concessions. The configurable cluster margin lets you intentionally leave some GPU streaming multiprocessors idle. This isn't waste.

It's creating space for communication and other tasks to happen without a traffic jam. Dynamic scheduling then uses that space to overlap computation with the essential chatter between experts and across data-parallel groups.

This approach treats the GPU as a shared, busy workshop, not a single assembly line. The goal isn't to max out one metric, but to smooth out the entire training pipeline. For MoE models, where the real enemy is often coordination overhead, that's the only kind of speed that matters.

Common Questions Answered

What performance improvement do fused kernels provide for Mixture-of-Experts model training?

Fused kernels deliver up to a 1.3x speedup on forward passes and backward passes by bundling small operations into larger, more coherent chunks of work. This allows the GPU to work more continuously instead of constantly starting and stopping tiny tasks, significantly improving overall training efficiency.

How do fused kernels solve the GPU utilization problem in MoE training?

Fused kernels address the GPU underutilization issue by consolidating multiple small operations into bigger computational units that keep the GPU working at higher capacity. Instead of the GPU being stuck waiting for messages between experts, the fused approach maintains more continuous work streams and reduces idle time.

What role does the configurable cluster margin play in the fused kernel optimization?

The configurable cluster margin intentionally leaves some GPU streaming multiprocessors idle to create space for communication and other essential tasks to occur without causing congestion. This strategic idle space allows dynamic scheduling to overlap computation with inter-expert communication and data-parallel group synchronization, preventing traffic jams on the GPU.

How does dynamic scheduling work with fused kernels in MoE models?

Dynamic scheduling uses the idle space created by the configurable cluster margin to overlap computation with essential communication between experts and across data-parallel groups. This approach treats the GPU as a shared workshop rather than a single assembly line, enabling more efficient coordination of both computational and communication tasks.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

Fused kernels boost MoE training, forward and backward...

Common Questions Answered

What performance improvement do fused kernels provide for Mixture-of-Experts model training?

How do fused kernels solve the GPU utilization problem in MoE training?

What role does the configurable cluster margin play in the fused kernel optimization?

How does dynamic scheduling work with fused kernels in MoE models?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

Anthropic Says Claude AI Hacked Systems in Cybersecurity Tests

Frozen CNN Feature Extractors Show Task-Dependent Sparsity in Reinforcement Learning

OpenAI Slashes GPT-5.6 Luna AI Model Price by 80%

Deep Agents v0.7 Updates Base Input Tokens, Performance Validated

Daniela Rus Wins Bavarian High-Tech Prize for Autonomous Robot Systems

Google Trains Gemini Robot AI With Human Teleoperation

Economy Relies on OpenAI, Anthropic IPOs Amid Political Uncertainty

Chrome Will Patch Twice Weekly After AI Finds More Bugs

Google DeepMind Demos AI Orchestrating Boston Dynamics Spot Robot

Nscale Acquires Anyscale in Vertical AI Compute Push

Related Reading

Google's FACTS benchmark shows 70% factuality ceiling across four tests

Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK

Nvidia's DLSS 4.5 beta adds 6x Multi Frame Generation for RTX 50 GPUs

Hybrid Open-Ended Tri-Evolution Improves Deep Research for AI Agents

Microsoft Research Mirage adds persistent spatial memory to video generation

Common Questions Answered

What performance improvement do fused kernels provide for Mixture-of-Experts model training?

How do fused kernels solve the GPU utilization problem in MoE training?

What role does the configurable cluster margin play in the fused kernel optimization?

How does dynamic scheduling work with fused kernels in MoE models?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

Anthropic Says Claude AI Hacked Systems in Cybersecurity Tests

Frozen CNN Feature Extractors Show Task-Dependent Sparsity in Reinforcement Learning

OpenAI Slashes GPT-5.6 Luna AI Model Price by 80%

Deep Agents v0.7 Updates Base Input Tokens, Performance Validated

Daniela Rus Wins Bavarian High-Tech Prize for Autonomous Robot Systems

Google Trains Gemini Robot AI With Human Teleoperation

Economy Relies on OpenAI, Anthropic IPOs Amid Political Uncertainty

Chrome Will Patch Twice Weekly After AI Finds More Bugs

Google DeepMind Demos AI Orchestrating Boston Dynamics Spot Robot

Nscale Acquires Anyscale in Vertical AI Compute Push