Illustration for: Nvidia's Nemotron 3 uses Mamba hybrid, 31.6B params, 3B active per step
LLMs & Generative AI

Nvidia's Nemotron 3 uses Mamba hybrid, 31.6B params, 3B active per step

2 min read

Nvidia’s latest language model, Nemotron 3, takes a different route than the pure‑Transformer designs that dominate most open‑source releases. By weaving a Mamba‑style component into its core, the team trimmed the active footprint dramatically: out of 31.6 billion total parameters, only 3 billion fire on any given processing step. The shift isn’t just a curiosity; it translates into measurable gains on the Artificial Analysis Index benchmark, where Nemotron 3 holds its own against established contenders like gpt‑oss‑20B and Qwen3‑30B in raw accuracy.

Yet the headline numbers tell a fuller story—throughput climbs noticeably, suggesting the model can handle more tokens per second without sacrificing quality. For developers wrestling with the trade‑off between model size and real‑time performance, those results raise a clear question: can a hybrid architecture deliver the efficiency needed for AI agents that must run at scale?

Hybrid architecture boosts efficiency The Nano model has 31.6 billion total parameters, but only 3 billion are active per processing step. On the Artificial Analysis Index benchmark, the open-source model rivals gpt-oss-20B and Qwen3-30B in accuracy but delivers significantly higher token throughput. However, according to Artificial Analysis, it requires 160 million tokens for a test run - far more than runner-up Qwen3-VL at 110 million.

Nvidia introduces two architectural changes for the larger Super and Ultra models. The first, LatentMoE, addresses the memory bandwidth cost of routing tokens directly to expert networks in standard MoE models. The new method projects tokens into a compressed, latent representation before processing.

Nvidia says this drastically increases expert count and active experts per token without slowing inference. The larger models also use multi-token prediction (MTP), where models predict several future tokens simultaneously during training rather than just the next one.

Related Topics: #Nvidia #Nemotron 3 #Mamba #Hybrid architecture #Artificial Analysis Index #gpt-oss-20B #Qwen3-30B #LatentMoE #multi-token prediction

Nvidia's Nemotron 3 family arrives with a hybrid Mamba‑Transformer design that claims to keep long context windows affordable. The Nano model, already on the market, packs 31.6 billion parameters but activates only three billion per step, a figure that translates into noticeably higher token throughput on the Artificial Analysis Index benchmark. It rivals open‑source rivals gpt‑oss‑20B and Qwen3‑30B in accuracy while moving faster.

Yet the real test will be whether the efficiency gains hold across diverse workloads beyond the benchmark. Super and Ultra, slated for release in the first half of 2026, will extend the lineup, but details on their performance remain sparse. The hybrid architecture is presented as a boost to efficiency, but it is unclear how much the reduced active parameter count will affect model capacity in practice.

A hybrid approach could reshape how agents handle extended tasks, though adoption will depend on developer confidence. Will developers trust a model that keeps most parameters dormant? For now, the numbers speak for themselves, and the trade‑off between active parameters and speed invites cautious optimism.

Further Reading

Common Questions Answered

How does Nemotron 3’s Mamba‑Transformer hybrid architecture affect its active parameter count?

Nemotron 3 integrates a Mamba‑style component with a traditional Transformer, which reduces the active footprint to only 3 billion parameters per processing step despite having 31.6 billion total parameters. This selective activation enables more efficient computation while preserving model capacity.

What performance advantages does Nemotron 3 show on the Artificial Analysis Index benchmark?

On the Artificial Analysis Index benchmark, Nemotron 3 matches the accuracy of open‑source models like gpt‑oss‑20B and Qwen3‑30B but delivers significantly higher token throughput. However, it consumes 160 million tokens for a test run, which is more than the 110 million tokens required by the runner‑up Qwen3‑VL.

Why is the token throughput of Nemotron 3 considered higher than its competitors?

Because only 3 billion of its 31.6 billion parameters are active at any step, Nemotron 3 processes tokens faster than models that activate a larger portion of their parameters. This efficiency translates into noticeably higher token throughput on benchmark tests.

What claim does Nvidia make about Nemotron 3’s ability to handle long context windows?

Nvidia asserts that the hybrid Mamba‑Transformer design keeps long context windows affordable by limiting active parameters per step, which reduces computational load. This design aims to maintain performance on extended sequences without the memory overhead typical of pure‑Transformer models.