Skip to main content
Run:ai orchestrates 64 GPUs to serve 10,200 users, outperforming native schedulers for AI/ML workloads. [run.ai](https://www.

Editorial illustration for Run:ai on 64 GPUs serves 10,200 users, matching native scheduler

GPU Fractioning Boosts LLM Inference Efficiency 3x

Run:ai on 64 GPUs serves 10,200 users, matching native scheduler

2 min read

Why does the raw capacity of a GPU cluster matter when you can slice it into smaller pieces? NVIDIA’s Run:ai platform promises exactly that—splitting a single GPU into fractional units while still handling the same workload volume. The research, classified under benchmarks, measured how many simultaneous users the system could sustain when the full 64‑GPU pool was allocated versus when the same pool was divided into half‑GPU slices.

The experiment also compared Run:ai’s scheduler against NVIDIA’s native scheduler to see whether the extra software layer introduced any latency or bottleneck. Numbers matter here: the test tracked concurrent user counts, token‑throughput rates, and the overhead (or lack thereof) introduced by the scheduling logic. By laying out the scaling curve from full‑GPU to 0.5‑GPU configurations, the study aims to answer whether fractional GPU usage can truly match the performance of traditional, unsplit deployments without sacrificing efficiency.

At 64 GPUs, NVIDIA Run:ai with full GPU allocation delivered 10,200 concurrent users versus 9,934 for the native scheduler, confirming the scheduler itself adds no overhead. Fractional GPU efficiency Concurrent user scaling: At 64 GPUs, the 0.5 GPU configuration supported 8,768 concurrent users, where the TTFT for each user did not go over one second (1,000 ms)--86% of the full GPU capacity (10,200 CCU). This demonstrates that fractional allocation introduces only a modest performance trade-off, enabling enterprises to run multiple models on shared GPUs or scale deployments more granularly without significant capacity loss (Figure 2).

Does the data prove Run:ai’s claim of seamless scaling? At 64 GPUs, the platform delivered 10,200 concurrent users with full‑GPU allocation, edging out the native scheduler’s 9,934 and showing no measurable overhead from the Run:ai scheduler itself. That figure suggests the software can match traditional approaches while offering extra flexibility.

When the GPUs were fractioned to 0.5 GPU per user, the system still supported 8,768 concurrent users, demonstrating that fractional allocation does not collapse under load. The benchmark, a joint effort with Nebius, confirms Run:ai’s ability to handle large‑language‑model traffic across cloud, NCP and on‑premises environments. However, the results stop at 64 GPUs; performance beyond that size, or under different model families, remains undocumented.

Likewise, the impact on latency or token‑throughput metrics was not disclosed, leaving open questions about real‑world efficiency. In short, the numbers validate the scheduler’s baseline functionality and show promising fractional‑GPU utilization, yet further testing will be needed to clarify its behavior in broader scenarios.

Further Reading

Common Questions Answered

How do GPU fractions improve resource utilization in large language model (LLM) inference?

[developer.nvidia.com](https://developer.nvidia.com/blog/unlock-massive-token-throughput-with-gpu-fractioning-in-nvidia-runai/) shows that GPU fractioning allows up to 3x more total system users when running mixed workloads on shared GPUs. The approach enables organizations to dramatically increase effective GPU capacity without compromising latency, achieving 77% of full GPU throughput using only a 0.5 GPU fraction.

What performance benefits did the NVIDIA and Nebius joint benchmarking reveal about fractional GPU allocation?

The benchmarking demonstrated near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions with modest time to first token (TTFT) impact. The results showed up to 2x more concurrent inference users on smaller models using 0.25 GPU fractions, with time to first token consistently under one second.

Why do enterprise IT departments struggle with traditional GPU allocation for LLM inference?

[developer.nvidia.com](https://developer.nvidia.com/blog/unlock-massive-token-throughput-with-gpu-fractioning-in-nvidia-runai/) highlights that enterprises typically need to allocate a dedicated GPU to a single LLM instance, even during sporadic traffic. This approach leads to inefficient resource utilization, as GPUs remain largely idle during periods of low demand, making fractional GPU scheduling a critical optimization technique for production environments.