NVIDIA unveils Nemotron-Labs TwoTower advanced AI model featuring 128-expert mixture-of-experts architecture for next-gen dif

Editorial illustration for NVIDIA launches Nemotron‑Labs‑TwoTower diffusion model with 128‑expert MoE

NVIDIA launches Nemotron‑Labs‑TwoTower diffusion model...

NVIDIA launches Nemotron‑Labs‑TwoTower diffusion model with 128‑expert MoE

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

July 1, 2026 • 2 min read

Why does this matter? NVIDIA just opened the doors on Nemotron‑Labs‑TwoTower, a diffusion‑based language model that runs on an existing autoregressive backbone. The model ships with open weights under the Nemotron Open Model License, aiming squarely at the throughput ceiling that plagues traditional token‑by‑token generation.

While most diffusion models juggle token cleaning and denoising in a single network, TwoTower splits the work into a frozen AR context tower and a trained denoiser tower. The result? Roughly 98.7 % of the autoregressive baseline’s benchmark quality, but with 2.42 × higher wall‑clock throughput when run on two H100 GPUs (γ = 0.8, S = 16).

Each tower comprises 52 layers—23 Mamba‑2, six self‑attention, and 23 mixture‑of‑experts—totaling about 60 billion parameters, though only around 3 billion are active per token. The denoiser saw about 2.1 trillion tokens during training; the backbone itself was pretrained on 25 trillion. Here’s the thing: one checkpoint can switch between diffusion, mock‑AR, and pure AR decoding modes, offering a flexible testbed for researchers chasing faster text generation.

The MoE uses 128 routable experts, of which 6 activate, plus 2 shared experts.
Both towers start as copies of the same backbone checkpoint. The denoiser was trained on ~2.1T tokens, a fraction of the backbone’s 25T-token pretraining.

How the Two Towers Work

The AR context tower runs causally over the prompt and committed tokens. It produces per-layer KV cache and final Mamba-2 states. It preserves the backbone’s autoregressive capability.

The diffusion denoiser tower refines noisy blocks. Within a block, it uses bidirectional in-block attention.

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone - MarkTechPost

Why this matters

We see NVIDIA opening a new route around the classic token‑by‑token bottleneck by pairing a frozen autoregressive backbone with a diffusion denoiser. The TwoTower design copies the Nemotron‑3‑Nano‑30B‑A3B checkpoint for both towers, then adds a 128‑expert mixture‑of‑experts layer where only six experts fire alongside two shared ones. The denoiser’s training on roughly 2.1 trillion tokens—far less than the backbone’s 25 trillion—suggests a lighter fine‑tuning effort, yet the performance trade‑offs are still unclear.

Open‑weight release under the Nemotron Open Model License invites developers to experiment, but the practical impact on real‑world throughput remains to be measured. For founders eyeing faster text generation, the model offers a tangible alternative to pure AR pipelines, though integration complexity could offset gains. Researchers may appreciate the modularity of separating backbone knowledge from diffusion‑based decoding, yet the efficacy of the limited active experts in diverse tasks is not yet proven.

In short, the approach is promising, but its advantages over existing methods need concrete validation.