Editorial illustration for Matmul Enables Dropless MoE Training; Grouped‑GEMM Kernel Drives Speed
Matmul Enables Dropless MoE Training; Grouped‑GEMM...
Matmul Enables Dropless MoE Training; Grouped‑GEMM Kernel Drives Speed
Mixture‑of‑Experts layers let transformer models grow without a linear rise in compute, but the usual JAX/MaxText workflow still drops tokens that exceed an expert’s capacity. The shortcut keeps every expert’s tensors a fixed shape, swaps quality for speed, and leaves a memory wall in the way of a truly dropless approach. That wall makes keeping every token impractical at production scale.
AMD’s Primus‑Turbo steps in with two Composable Kernel‑backed primitives: a grouped GEMM that handles ragged, variable‑length expert matmuls, and a DeepEP dispatch/combine all‑to‑all for token‑aware routing. The kernels appear as first‑class JAX ops via the XLA FFI, preserving autodiff, sharding contracts, and numerical fidelity. In practice you enable the dropless path with just two flags in MaxText, letting the custom VJPs and once‑per‑process bootstrap do the heavy lifting.
Early results show the dropless route can outpace the capacity‑factor default on both throughput and convergence, turning what was once infeasible on AMD Instinct GPUs into a faster, more memory‑efficient default.
But it is the matmul that lets you be dropless at all, and a well-tuned grouped-GEMM kernel is the single most important primitive for fast dropless MoE training (the approach MegaBlocks introduced for GPUs). DeepEP# With the matmul wall handled by the grouped GEMM, the routing all-to-all is the one that's left. DeepEP is an expert-parallel communication library: a pair of dispatch and combine kernels that implement the MoE all-to-all in a token-aware way.
dispatch sends each rank's local tokens to the rank(s) owning their selected experts ( topk_idx ), over NVLink/xGMI for intranode communication and RDMA for internode communication. Its receive buffer is still worst-case (num_tokens * ep_size ) -- DeepEP doesn't escape the pessimistic allocation of dropless routing -- but it manages that buffer more leanly (chunked send/recv, fewer intermediate copies than a genericragged_all_to_all ), so the transient footprint is somewhat smaller.combine is the exact reverse: it sends each expert's outputs back to the ranks that contributed the tokens and reduces (sums) them at the destination. The figure below shows this dispatch → expert-compute → combine round-trip across GPUs: dispatch returns an opaque handle describing the communication layout (rank/channel prefix matrices, source indices, send heads).
Why this matters Dropless MoE training in JAX now has a concrete path forward, thanks to a grouped‑GEMM kernel that tackles the matmul bottleneck. The kernel, originally introduced by MegaBlocks for GPUs, is described as the single most important primitive for fast dropless MoE training. For developers, that means we can keep every token in the forward pass without the quality loss incurred by dropping overflow tokens. It also suggests a simpler memory layout, since experts no longer need fixed‑shape tensors that discard data.
Founders, however, should note that the routing all‑to‑all step remains the remaining performance hurdle. The article flags this as the next obstacle once the matmul wall is removed. Researchers may find the approach attractive for scaling transformers, yet the actual impact on end‑to‑end training time is still unclear.
We appreciate the engineering advance, but we remain cautious until broader benchmarks confirm that the routing overhead doesn’t erode the gains from the grouped‑GEMM. Until then, the technique offers a promising, though not yet fully proven, option for those building large‑scale MoE systems.
Further Reading
- Training MoEs at Scale with PyTorch - PyTorch Blog
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts - arXiv
- MoE Training Optimization — Megatron Bridge - NVIDIA Docs
- Mixture of experts with Dropless Computation - SugiV Blog
- Explore Mixture of Experts (MoE) inference support for Neuron - AWS Docs