Microsoft's Azure ND GB300 VM hits 1.1 M tokens/sec, 50% more GPU memory
Microsoft just posted a new performance record for its Azure ND GB300 virtual machine: more than 1.1 million tokens per second on a single inference run. Why does that matter? In the world of large‑language‑model serving, every extra token per second can translate into lower latency for end‑users and tighter cost margins for cloud operators.
While the headline number grabs attention, the test behind it is equally telling. Microsoft used the Llama 2 70‑billion‑parameter model, running in FP4 precision, as part of the MLPerf Inference v5.1 suite. Eighteen ND GB300 v6 instances were chained together on a single NVIDIA GB300 NVL accelerator to generate the result.
The setup mirrors a realistic deployment scenario where multiple GPUs collaborate to handle massive workloads. It also hints at how Azure is positioning its hardware‑focused VMs against competing cloud offerings. The details of the VM’s configuration and the exact methodology follow.
---
The VM is optimised for inference workloads, featuring 50% more GPU memory and a 16% higher TDP (Thermal Design Power). To simulate the performance gains, Microsoft ran the Llama2 70B (in FP4 precision) from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL.
The VM is optimised for inference workloads, featuring 50% more GPU memory and a 16% higher TDP (Thermal Design Power). To simulate the performance gains, Microsoft ran the Llama2 70B (in FP4 precision) from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL72 domain. This used the NVIDIA TensorRT-LLM as the inference engine.
"One NVL72 rack of Azure ND GB300 v6 achieved an aggregated 1,100,000 tokens/s," said Microsoft. "This is a new record in AI inference, beating our own previous record of 865,000 tokens/s on one NVIDIA GB200 NVL72 rack with the ND GB200 v6 VMs." Since the system contains 72 Blackwell Ultra GPUs, the performance roughly translates to ~15,200 tokens/sec/GPU.
Microsoft's Azure ND GB300 VM posted 1.1 million tokens per second on Meta’s Llama 2 70 B model. The figure comes from an MLPerf Inference v5.1 run using FP4 precision across 18 identical instances. Each machine houses NVIDIA’s Blackwell Ultra GB300 NVL72, delivering 50 % more GPU memory and a 16 % higher TDP than prior offerings.
Satya Nadella called the result an industry record, crediting co‑innovation with NVIDIA and production‑scale expertise. The VM is billed as inference‑optimized, but the benchmark focuses on a single, large language model; performance on smaller or differently structured models has not been disclosed. Likewise, the impact of the increased thermal design power on operational costs remains unquantified.
Still, the numbers demonstrate a tangible step forward for Azure’s AI‑focused infrastructure. Whether the memory boost and power headroom will translate into consistent gains across diverse workloads is still uncertain. For now, the Azure ND GB300 stands as a high‑throughput option for customers targeting large‑scale LLM inference.
Further Reading
- Breaking the Million-Token Barrier - Azure ND GB300 v6 Virtual Machines with NVIDIA GB300 NVL72 rack-scale systems achieve unprecedented performance of 1,100,000 tokens/s on Llama2 70B Inference - Microsoft Tech Community
- Reimagining AI at scale: NVIDIA GB300 NVL72 on Azure - Microsoft Tech Community
- Building the future together: Microsoft and NVIDIA announce AI advancements at GTC DC - Azure Blog
- Microsoft Azure upgraded to NVIDIA GB300 'Blackwell Ultra' with 4600 GPUs connected together - TweakTown
- Nvidia and Microsoft to Redefine Data Centre Supercomputers - Data Centre Magazine
Common Questions Answered
What performance record did Microsoft’s Azure ND GB300 VM achieve on the Llama 2 70‑billion‑parameter model?
Microsoft reported that a single Azure ND GB300 VM processed over 1.1 million tokens per second when running the Llama 2 70B model in FP4 precision. This record was measured during an MLPerf Inference v5.1 benchmark across 18 identical VM instances.
How does the Azure ND GB300 VM’s GPU memory and TDP compare to previous Azure offerings?
The ND GB300 VM provides 50 % more GPU memory and a 16 % higher Thermal Design Power (TDP) than earlier Azure inference machines. These hardware upgrades, based on NVIDIA’s Blackwell Ultra GB300 NVL72, enable higher throughput for large‑language‑model serving.
Which inference engine and precision mode were used in the MLPerf benchmark for the Azure ND GB300 VM?
The benchmark employed NVIDIA TensorRT‑LLM as the inference engine and ran the Llama 2 70B model using FP4 (four‑bit floating‑point) precision. This combination allowed the VM to achieve the reported token‑per‑second rate while maintaining accuracy.
Why is achieving 1.1 million tokens per second significant for cloud operators and end‑users?
Higher token throughput reduces latency for end‑users, delivering faster responses from large‑language‑model applications. For cloud operators, the efficiency translates into tighter cost margins by handling more inference requests per unit of hardware.