Skip to main content
Zhipu AI showcasing Muon Optimizer and Muon Split enhancing GLM-4.5 and GLM-5 pretraining for advanced AI model efficiency an

Editorial illustration for Zhipu AI employs Muon Optimizer and Muon Split in GLM-4.5 and GLM-5 pretraining

Zhipu AI employs Muon Optimizer and Muon Split in...

Zhipu AI employs Muon Optimizer and Muon Split in GLM-4.5 and GLM-5 pretraining

2 min read

DeepSpeed just added support for Muon Optimizer, a tool that’s gaining traction in frontier AI labs. Moonshot AI, for example, has already integrated Muon into the training pipeline for its Kimi‑K2‑Thinking foundation model. Here’s the thing: Muon is built specifically for the hidden 2‑D weight matrices that dominate transformer architectures.

It takes the gradient, computes a single momentum buffer, then runs Newton‑Schulz iterations to orthogonalize that momentum before updating the weight. Because it only keeps one buffer—unlike Adam’s two—it trims the memory footprint of optimizer states.

While the math sounds dense, the payoff shows up in benchmarks. In NanoGPT speed‑running tests Muon shaved 35 % off training time compared with AdamW, and at the 1.5 B‑parameter scale it hit GPT‑2 XL‑level performance roughly 25 % faster. The orthogonalization step matters: gradient updates for 2‑D weights often have huge condition numbers, meaning a few singular directions dominate.

By equalizing those values, Muon amplifies rare but important directions that AdamW would otherwise mute. The result is a more sample‑efficient pretraining process that could reshape how large models are built.

More recently, Zhipu AI's GLM-5 (744B parameters) confirmed the use of Muon Optimizer in both GLM-4.5 and GLM-5 pretraining, along with a "Muon Split" technique that splits MLA up-projection matrices by attention head and orthogonalizes each head independently, addressing a performance gap between MLA and GQA when using Muon DeepSeek-V4 (1.6T parameters) also employs the Muon Optimizer for faster convergence and greater training stability. Muon Optimizer support in DeepSpeed One of the challenges of applying Muon optimizer to DeepSpeed is that previous optimizers (SGD, Adam) look at gradients as flattened buffers.

Why this matters We see DeepSpeed adding native support for Muon Optimizer, a tool already adopted by frontier labs such as Moonshot AI for its Kimi‑K2‑Thinking model and now confirmed in Zhipu AI’s GLM‑4.5 and GLM‑5 pre‑training pipelines. The optimizer targets hidden 2‑D weight structures, and the accompanying “Muon Split” technique breaks MLA up‑projection matrices into per‑head components that are orthogonalized independently, a design claimed to close a performance gap observed in earlier runs. For developers, this means a new pathway to scale large language models without reinventing low‑level training loops, assuming the integration holds up under diverse workloads.

Researchers may find the head‑wise orthogonalization an interesting angle for probing attention dynamics, though the article does not detail empirical gains or trade‑offs. Founders should note the growing ecosystem around DeepSpeed extensions, yet it remains unclear whether Muon’s benefits will translate across different model architectures or hardware configurations. As we experiment with these tools, we must balance enthusiasm with careful validation of the reported improvements.

Further Reading