Editorial illustration for NVIDIA launches Nemotron‑Labs‑TwoTower diffusion model with 128‑expert MoE
NVIDIA launches Nemotron‑Labs‑TwoTower diffusion model...
NVIDIA launches Nemotron‑Labs‑TwoTower diffusion model with 128‑expert MoE
Why does this matter? NVIDIA just opened the doors on Nemotron‑Labs‑TwoTower, a diffusion‑based language model that runs on an existing autoregressive backbone. The model ships with open weights under the Nemotron Open Model License, aiming squarely at the throughput ceiling that plagues traditional token‑by‑token generation.
While most diffusion models juggle token cleaning and denoising in a single network, TwoTower splits the work into a frozen AR context tower and a trained denoiser tower. The result? Roughly 98.7 % of the autoregressive baseline’s benchmark quality, but with 2.42 × higher wall‑clock throughput when run on two H100 GPUs (γ = 0.8, S = 16).
Each tower comprises 52 layers—23 Mamba‑2, six self‑attention, and 23 mixture‑of‑experts—totaling about 60 billion parameters, though only around 3 billion are active per token. The denoiser saw about 2.1 trillion tokens during training; the backbone itself was pretrained on 25 trillion. Here’s the thing: one checkpoint can switch between diffusion, mock‑AR, and pure AR decoding modes, offering a flexible testbed for researchers chasing faster text generation.
The MoE uses 128 routable experts, of which 6 activate, plus 2 shared experts.
Both towers start as copies of the same backbone checkpoint. The denoiser was trained on ~2.1T tokens, a fraction of the backbone’s 25T-token pretraining.
How the Two Towers Work
The AR context tower runs causally over the prompt and committed tokens. It produces per-layer KV cache and final Mamba-2 states. It preserves the backbone’s autoregressive capability.
The diffusion denoiser tower refines noisy blocks. Within a block, it uses bidirectional in-block attention.
Why this matters
We see NVIDIA opening a new route around the classic token‑by‑token bottleneck by pairing a frozen autoregressive backbone with a diffusion denoiser. The TwoTower design copies the Nemotron‑3‑Nano‑30B‑A3B checkpoint for both towers, then adds a 128‑expert mixture‑of‑experts layer where only six experts fire alongside two shared ones. The denoiser’s training on roughly 2.1 trillion tokens—far less than the backbone’s 25 trillion—suggests a lighter fine‑tuning effort, yet the performance trade‑offs are still unclear.
Open‑weight release under the Nemotron Open Model License invites developers to experiment, but the practical impact on real‑world throughput remains to be measured. For founders eyeing faster text generation, the model offers a tangible alternative to pure AR pipelines, though integration complexity could offset gains. Researchers may appreciate the modularity of separating backbone knowledge from diffusion‑based decoding, yet the efficacy of the limited active experts in diverse tasks is not yet proven.
In short, the approach is promising, but its advantages over existing methods need concrete validation.
Further Reading
- NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6x Tokens Per Forward Over Qwen3-8B - MarkTechPost
- NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6x Faster - Dev.to
- Nemotron-Labs-Diffusion from NVIDIA - Reddit (LocalLLaMA)
- NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model - Reddit (LocalLLaMA)
- The Full Story of the 6x Faster Text Generation Achieved by NVIDIA Nemotron-Labs Diffusion - note.com