Skip to main content
Tilde Research’s Aurora optimizer outperforming Muon and NorMuon benchmark results at 340 million-scale, showcasing superior

Editorial illustration for Tilde Research's Aurora optimizer beats Muon and NorMuon at 340M scale

Tilde Research's Aurora optimizer beats Muon and NorMuon...

Tilde Research's Aurora optimizer beats Muon and NorMuon at 340M scale

2 min read

Here's the thing: Tilde Research just dropped Aurora, a new optimizer that patches a hidden flaw in Muon. While Muon earned praise for beating AdamW in wall‑clock time to convergence on the nanoGPT speedrun, it also quietly kills off a sizable slice of MLP neurons during training, leaving them permanently dead. Aurora targets that problem head‑on, and the team backs it with a 1.1 billion‑parameter pretraining run and a fresh state‑of‑the‑art score on the modded‑nanoGPT speedrun benchmark. The code is open, so anyone can test the claims.

To understand why Aurora matters, recall Muon's core step: it computes the polar factor of the gradient matrix. Given a thin SVD G = UΣVᵀ, Muon forms polar(G) = UVᵀ and updates weights as W ← W − η UVᵀ, using matmul‑only iterative algorithms that scale. Before Aurora, NorMuon introduced a row‑normalization tweak—similar to Adam’s per‑parameter scaling—that improved speedrun results, yet the underlying reason remained murky.

Aurora promises a more principled fix. It remains to see how broadly the fix will translate across model families.

U-NorMuon corrects this by normalizing tall matrix rows to have norm √(n/m) instead of 1.

In experiments at 340M scale, U-NorMuon outperforms both Muon and standard NorMuon and completely eliminates the neuron death phenomenon -- leverage scores become approximately isotropic throughout training. Crucially, U-NorMuon propagates this benefit to layers it doesn’t directly touch: keeping up/gate rows alive ensures isotropic gradient flow into the down-projection, stabilizing its column leverage without any direct intervention.

However, U-NorMuon still has a problem: it forcefully overrides the polar factor with uniform row norms, sacrificing polar factor precision, which is both theoretically undesirable and empirically costly in the Muon framework (the paper shows that Muon achieves monotonically lower loss with more precise orthogonalization).

Why this matters

Aurora shows that a “leverage‑aware” tweak can patch a hidden flaw in Muon, the optimizer many of us rely on. By stopping the silent death of a sizable share of MLP neurons, the new method delivers a 1.1 B‑parameter pretraining run and claims a state‑of‑the‑art result on the modded‑nanoGPT speedrun benchmark. The open‑source release lets us inspect the changes directly.

U‑NorMuon, meanwhile, normalizes tall matrix rows to √(n/m) rather than 1, a simple adjustment that, in 340 M‑scale tests, beats both Muon and the standard NorMuon while fully eradicating neuron death; leverage scores stay roughly isotropic throughout training.

What remains unclear is whether these gains persist beyond the specific benchmarks reported, or how they translate to larger, production‑level models. We also lack details on computational overhead or stability across diverse architectures. For developers and researchers, Aurora and U‑NorMuon merit a closer look, but we should temper enthusiasm until broader evaluations confirm the reported improvements.

Further Reading