Run:ai orchestrates 64 GPUs to serve 10,200 users, outperforming native schedulers for AI/ML workloads. [run.ai](https://www.

Editorial illustration for Run:ai on 64 GPUs serves 10,200 users, matching native scheduler

GPU Fractioning Boosts LLM Inference Efficiency 3x

Run:ai on 64 GPUs serves 10,200 users, matching native scheduler

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

February 20, 2026 • Updated: July 15, 2026 • 3 min read

Forget the hype. NVIDIA's Run:ai just ran a brutally simple test to see what happens when you actually use the thing: take 64 GPUs and throw users at them. Not a simulation.

10,200 people hit it at once. The result? It matched a native scheduler exactly.

The management layer, that supposed source of bloat, added precisely nothing. Zero overhead. That's the baseline.

The NVIDIA Run:ai platform addresses these pain points through its high-throughput AI workload scheduler, built for large-scale GPU clusters and dynamic fractional GPU allocation, without sacrificing performance.

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai - NVIDIA Developer Blog

Then they cut the pie. Slicing each GPU in half, giving users 0.5 of a GPU instead of a whole one, the system still handled 8,768 concurrent users. That’s 86 percent of full capacity, with each request answering in under a second. This changes the math.

A 14 percent trade-off in total throughput buys you the ability to run two different models on the same physical hardware. You can scale user counts in small, precise increments instead of buying whole new GPUs. The idle capacity that plagues data centers becomes usable.

A GPU is no longer a monolithic block but a pool of compute you can divide and recombine on the fly. The constraint shifts from the physical chip to the logic that manages it. For anyone paying for these machines, that’s a new way to think.

Common Questions Answered

How do GPU fractions improve resource utilization in large language model (LLM) inference?

[developer.nvidia.com](https://developer.nvidia.com/blog/unlock-massive-token-throughput-with-gpu-fractioning-in-nvidia-runai/) shows that GPU fractioning allows up to 3x more total system users when running mixed workloads on shared GPUs. The approach enables organizations to dramatically increase effective GPU capacity without compromising latency, achieving 77% of full GPU throughput using only a 0.5 GPU fraction.

What performance benefits did the NVIDIA and Nebius joint benchmarking reveal about fractional GPU allocation?

The benchmarking demonstrated near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions with modest time to first token (TTFT) impact. The results showed up to 2x more concurrent inference users on smaller models using 0.25 GPU fractions, with time to first token consistently under one second.

Why do enterprise IT departments struggle with traditional GPU allocation for LLM inference?

[developer.nvidia.com](https://developer.nvidia.com/blog/unlock-massive-token-throughput-with-gpu-fractioning-in-nvidia-runai/) highlights that enterprises typically need to allocate a dedicated GPU to a single LLM instance, even during sporadic traffic. This approach leads to inefficient resource utilization, as GPUs remain largely idle during periods of low demand, making fractional GPU scheduling a critical optimization technique for production environments.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

GPU Fractioning Boosts LLM Inference Efficiency 3x

Common Questions Answered

How do GPU fractions improve resource utilization in large language model (LLM) inference?

What performance benefits did the NVIDIA and Nebius joint benchmarking reveal about fractional GPU allocation?

Why do enterprise IT departments struggle with traditional GPU allocation for LLM inference?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism

Related Reading

Google's FACTS benchmark shows 70% factuality ceiling across four tests

Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK

Nvidia's DLSS 4.5 beta adds 6x Multi Frame Generation for RTX 50 GPUs

Google unveils Gemini 3.1 Pro, hits 94.3% GPQA Diamond and coding Elo 2

Google launches AI Professional Certificate to boost fluency for workers

Common Questions Answered

How do GPU fractions improve resource utilization in large language model (LLM) inference?

What performance benefits did the NVIDIA and Nebius joint benchmarking reveal about fractional GPU allocation?

Why do enterprise IT departments struggle with traditional GPU allocation for LLM inference?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism