Cutting-edge AI research: Matmul and Grouped-GEMM kernel optimize dropless Mixture of Experts training for faster, more effic

Editorial illustration for Matmul Enables Dropless MoE Training; Grouped‑GEMM Kernel Drives Speed

Matmul Enables Dropless MoE Training; Grouped‑GEMM...

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 10, 2026 • Updated: July 4, 2026 • 3 min read

Dropless Mixture-of-Experts training has long been haunted by a simple truth: you cannot skip the tokens, but you can pay for them in compute. The matmul is what makes dropless possible at all, but without a well-tuned grouped-GEMM kernel, that possibility remains academic. MegaBlocks proved this on GPUs.

Now DeepEP tackles the remaining bottleneck, the routing all-to-all. Its dispatch and combine kernels move tokens across ranks with token-aware precision, leaning on NVLink and RDMA. The receive buffer is still worst-case, but leaner chunked transfers shrink the transient footprint.

This is the architecture that turns dropless from a theory into a practical speed advantage.

You’ll learn how grouped GEMM and DeepEP work, how to integrate a custom kernel through JAX’s FFI — custom VJPs, sharding contracts, and a once-per-process bootstrap included — and how the dropless path stacks up against the capacity-factor default on both throughput and convergence. By the end, you’ll see how Primus-Turbo turns dropless MoE training on AMD Instinct GPUs from infeasible into a practical, faster, and more memory-efficient default.

Dropless MoE Training in JAX with Primus-Turbo - AMD ROCm AI Blog

The grouped-GEMM kernel tore down the matmul bottleneck, turning a theoretical advantage into a practical one. DeepEP then did the same for the all-to-all routing. It does not eliminate the worst-case buffer, physics and parallelism demand that, but it manages the transient footprint with surgical precision.

Chunked sends, fewer copies, a leaner pipeline. Dispatch and combine, mirror opposites, now complete a round-trip that is both token-aware and expert-efficient. This is the formula: a well-tuned primitive for compute, a purpose-built library for communication.

Together, they make dropless MoE training not just feasible, but fast. The wall has been moved.

Common Questions Answered

What is the key innovation that makes dropless Mixture-of-Experts training practical?

The grouped-GEMM kernel is the critical innovation that enables dropless MoE training by eliminating the matmul bottleneck. While dropless MoE theoretically allows processing all tokens without skipping, the grouped-GEMM kernel transforms this theoretical advantage into practical performance by optimizing the matrix multiplication operations that would otherwise consume excessive compute resources.

How does DeepEP address the routing all-to-all bottleneck in MoE systems?

DeepEP tackles the all-to-all routing bottleneck through specialized dispatch and combine kernels that move tokens across ranks with token-aware precision. These kernels leverage NVLink and RDMA technologies to efficiently manage token routing while reducing transient buffer footprint through techniques like chunked sends and fewer copies.

What problem did MegaBlocks solve and what remaining challenge does DeepEP address?

MegaBlocks proved that the grouped-GEMM kernel could overcome the matmul bottleneck on GPUs, making dropless MoE training feasible. DeepEP then tackled the remaining bottleneck by optimizing the routing all-to-all communication, completing the efficiency improvements needed for practical dropless MoE training at scale.

Why can't dropless MoE training completely eliminate worst-case buffer requirements?

Physics and parallelism constraints inherently demand some worst-case buffer overhead that cannot be eliminated entirely. However, DeepEP manages the transient footprint with surgical precision through optimized dispatch and combine operations, minimizing buffer usage even though complete elimination remains impossible due to fundamental system constraints.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

Matmul Enables Dropless MoE Training; Grouped‑GEMM...

Common Questions Answered

What is the key innovation that makes dropless Mixture-of-Experts training practical?

How does DeepEP address the routing all-to-all bottleneck in MoE systems?

What problem did MegaBlocks solve and what remaining challenge does DeepEP address?

Why can't dropless MoE training completely eliminate worst-case buffer requirements?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

Black Forest Labs Upgrades AI to Generate 20-Second Videos

Opus 5 Hits Zero Percent Attack Rate Against AI Browser Prompt Injections

OpenAI Models Escaped Containment for Days in Hugging Face Breach

Claude Opus 5 cheaper than Fable 5 but still trails on fact accuracy

OpenAI Agent's Code Execution Breach Was Predicted by Researchers

Grok Build CLI Excels at Greenfield Coding, Testing Reveals

South Korea Charts AI Future With NVIDIA at Summit

OpenAI's Micro keypad: A coder's tool that mystifies others

Anthropic Launches Opus 5 AI Model, Completes Series Update

Prentis, AI lab from Hoffman and Pincus, in talks to raise USD 100M

Related Reading

Trump cracks down on Anthropic after Amazon tip; staff largely foreign

SDOF Adds Two Defensive Layers via Intent Router and StateAwareDisp

D&B rebuilds 642 million‑business database after AI agents hit limits

LangChain Emergency Helpline Uses AssemblyAI WebSocket for Live STT

Meta launches Hatch AI agent, its first paid product, priced up to USD 200/month

Common Questions Answered

What is the key innovation that makes dropless Mixture-of-Experts training practical?

How does DeepEP address the routing all-to-all bottleneck in MoE systems?

What problem did MegaBlocks solve and what remaining challenge does DeepEP address?

Why can't dropless MoE training completely eliminate worst-case buffer requirements?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

Black Forest Labs Upgrades AI to Generate 20-Second Videos

Opus 5 Hits Zero Percent Attack Rate Against AI Browser Prompt Injections

OpenAI Models Escaped Containment for Days in Hugging Face Breach

Claude Opus 5 cheaper than Fable 5 but still trails on fact accuracy

OpenAI Agent's Code Execution Breach Was Predicted by Researchers

Grok Build CLI Excels at Greenfield Coding, Testing Reveals

South Korea Charts AI Future With NVIDIA at Summit

OpenAI's Micro keypad: A coder's tool that mystifies others

Anthropic Launches Opus 5 AI Model, Completes Series Update

Prentis, AI lab from Hoffman and Pincus, in talks to raise USD 100M