Microsoft's Azure ND GB300 VM hits 1.1 M tokens/sec, 50% more GPU memory
Seeing over 1.1 million tokens per second on a single Azure ND GB300 VM feels like a milestone, even if the exact impact is still up for debate. Microsoft ran the Llama 2 70-billion-parameter model in FP4 precision as part of the MLPerf Inference v5.1 suite, and they linked 18 ND GB300 v6 instances to a single NVIDIA GB300 NVL accelerator. The setup mimics a real-world scenario where several GPUs share the load, which could mean lower latency for users and tighter cost margins for cloud operators.
It also suggests Azure is nudging its hardware-focused VMs into the same arena as rival cloud services. The numbers grab attention, but the test details are what really matter - the VM specs, the exact methodology, and how this compares to other offerings are all laid out below.
---
The VM is tuned for inference, offering about 50 % more GPU memory and a roughly 16 % higher TDP. To gauge the gains, Microsoft ran the same Llama 2 70B FP4 workload from MLPerf v5.1 on each of the 18 ND GB300 v6 instances, all feeding into one NVIDIA GB300 NVL.
The VM is optimised for inference workloads, featuring 50% more GPU memory and a 16% higher TDP (Thermal Design Power). To simulate the performance gains, Microsoft ran the Llama2 70B (in FP4 precision) from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL72 domain. This used the NVIDIA TensorRT-LLM as the inference engine.
"One NVL72 rack of Azure ND GB300 v6 achieved an aggregated 1,100,000 tokens/s," said Microsoft. "This is a new record in AI inference, beating our own previous record of 865,000 tokens/s on one NVIDIA GB200 NVL72 rack with the ND GB200 v6 VMs." Since the system contains 72 Blackwell Ultra GPUs, the performance roughly translates to ~15,200 tokens/sec/GPU.
Microsoft’s Azure ND GB300 VM managed about 1.1 million tokens per second on Meta’s Llama 2 70 B model. That number came from an MLPerf Inference v5.1 run, using FP4 precision across 18 identical instances. Each box packs NVIDIA’s Blackwell Ultra GB300 NVL72, which seems to give roughly 50 % more GPU memory and a 16 % bump in TDP compared with the previous generation.
Satya Nadella called it an industry record, pointing to the joint work with NVIDIA and the scale-up know-how at Microsoft. The VM is marketed as inference-optimized, yet the test only covered one huge language model; we haven’t seen results for smaller or differently-shaped models. Likewise, the higher thermal design power could affect operating costs, but that impact hasn’t been quantified yet.
Still, the figures feel like a real step forward for Azure’s AI-focused hardware. I’m not sure whether the extra memory and power headroom will consistently help across a broader set of workloads. For now, the Azure ND GB300 looks like a solid high-throughput choice for anyone needing large-scale LLM inference.
Common Questions Answered
What performance record did Microsoft’s Azure ND GB300 VM achieve on the Llama 2 70‑billion‑parameter model?
Microsoft reported that a single Azure ND GB300 VM processed over 1.1 million tokens per second when running the Llama 2 70B model in FP4 precision. This record was measured during an MLPerf Inference v5.1 benchmark across 18 identical VM instances.
How does the Azure ND GB300 VM’s GPU memory and TDP compare to previous Azure offerings?
The ND GB300 VM provides 50 % more GPU memory and a 16 % higher Thermal Design Power (TDP) than earlier Azure inference machines. These hardware upgrades, based on NVIDIA’s Blackwell Ultra GB300 NVL72, enable higher throughput for large‑language‑model serving.
Which inference engine and precision mode were used in the MLPerf benchmark for the Azure ND GB300 VM?
The benchmark employed NVIDIA TensorRT‑LLM as the inference engine and ran the Llama 2 70B model using FP4 (four‑bit floating‑point) precision. This combination allowed the VM to achieve the reported token‑per‑second rate while maintaining accuracy.
Why is achieving 1.1 million tokens per second significant for cloud operators and end‑users?
Higher token throughput reduces latency for end‑users, delivering faster responses from large‑language‑model applications. For cloud operators, the efficiency translates into tighter cost margins by handling more inference requests per unit of hardware.