AI assistant is currently unavailable. Alternative content delivery method activated.
Business & Startups

Microsoft's Azure ND GB300 VM hits 1.1 M tokens/sec, 50% more GPU memory

3 min read

Microsoft just posted a new performance record for its Azure ND GB300 virtual machine: more than 1.1 million tokens per second on a single inference run. Why does that matter? In the world of large‑language‑model serving, every extra token per second can translate into lower latency for end‑users and tighter cost margins for cloud operators.

While the headline number grabs attention, the test behind it is equally telling. Microsoft used the Llama 2 70‑billion‑parameter model, running in FP4 precision, as part of the MLPerf Inference v5.1 suite. Eighteen ND GB300 v6 instances were chained together on a single NVIDIA GB300 NVL accelerator to generate the result.

The setup mirrors a realistic deployment scenario where multiple GPUs collaborate to handle massive workloads. It also hints at how Azure is positioning its hardware‑focused VMs against competing cloud offerings. The details of the VM’s configuration and the exact methodology follow.

---

The VM is optimised for inference workloads, featuring 50% more GPU memory and a 16% higher TDP (Thermal Design Power). To simulate the performance gains, Microsoft ran the Llama2 70B (in FP4 precision) from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL.

The VM is optimised for inference workloads, featuring 50% more GPU memory and a 16% higher TDP (Thermal Design Power). To simulate the performance gains, Microsoft ran the Llama2 70B (in FP4 precision) from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL72 domain. This used the NVIDIA TensorRT-LLM as the inference engine.

"One NVL72 rack of Azure ND GB300 v6 achieved an aggregated 1,100,000 tokens/s," said Microsoft. "This is a new record in AI inference, beating our own previous record of 865,000 tokens/s on one NVIDIA GB200 NVL72 rack with the ND GB200 v6 VMs." Since the system contains 72 Blackwell Ultra GPUs, the performance roughly translates to ~15,200 tokens/sec/GPU.

Related Topics: #Azure ND GB300 #Llama2 70B #FP4 precision #MLPerf Inference v5.1 #NVIDIA GB300 NVL #TensorRT-LLM #GPU memory #AI inference #tokens per second

Microsoft's Azure ND GB300 VM posted 1.1 million tokens per second on Meta’s Llama 2 70 B model. The figure comes from an MLPerf Inference v5.1 run using FP4 precision across 18 identical instances. Each machine houses NVIDIA’s Blackwell Ultra GB300 NVL72, delivering 50 % more GPU memory and a 16 % higher TDP than prior offerings.

Satya Nadella called the result an industry record, crediting co‑innovation with NVIDIA and production‑scale expertise. The VM is billed as inference‑optimized, but the benchmark focuses on a single, large language model; performance on smaller or differently structured models has not been disclosed. Likewise, the impact of the increased thermal design power on operational costs remains unquantified.

Still, the numbers demonstrate a tangible step forward for Azure’s AI‑focused infrastructure. Whether the memory boost and power headroom will translate into consistent gains across diverse workloads is still uncertain. For now, the Azure ND GB300 stands as a high‑throughput option for customers targeting large‑scale LLM inference.

Further Reading

Common Questions Answered

What performance record did Microsoft’s Azure ND GB300 VM achieve on the Llama 2 70‑billion‑parameter model?

Microsoft reported that a single Azure ND GB300 VM processed over 1.1 million tokens per second when running the Llama 2 70B model in FP4 precision. This record was measured during an MLPerf Inference v5.1 benchmark across 18 identical VM instances.

How does the Azure ND GB300 VM’s GPU memory and TDP compare to previous Azure offerings?

The ND GB300 VM provides 50 % more GPU memory and a 16 % higher Thermal Design Power (TDP) than earlier Azure inference machines. These hardware upgrades, based on NVIDIA’s Blackwell Ultra GB300 NVL72, enable higher throughput for large‑language‑model serving.

Which inference engine and precision mode were used in the MLPerf benchmark for the Azure ND GB300 VM?

The benchmark employed NVIDIA TensorRT‑LLM as the inference engine and ran the Llama 2 70B model using FP4 (four‑bit floating‑point) precision. This combination allowed the VM to achieve the reported token‑per‑second rate while maintaining accuracy.

Why is achieving 1.1 million tokens per second significant for cloud operators and end‑users?

Higher token throughput reduces latency for end‑users, delivering faster responses from large‑language‑model applications. For cloud operators, the efficiency translates into tighter cost margins by handling more inference requests per unit of hardware.