Tsinghua and Moonshot AI unveil PrfaaS KVCache, an LLM node auto-balancing system for enhanced throughput.

Editorial illustration for Moonshot AI, Tsinghua unveil PrfaaS KVCache that auto‑balances LLM nodes for throughput

PrfaaS: AI Nodes Optimized for LLM Peak Performance

Moonshot AI, Tsinghua unveil PrfaaS KVCache that auto‑balances LLM nodes for throughput

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

April 20, 2026 • Updated: July 4, 2026 • 4 min read

Moonshot AI and Tsinghua have introduced PrfaaS, a cross-datacenter KVCache architecture that automatically rebalances prefill and decode nodes as traffic shifts. The payoff is concrete: median time to first token drops by half, P90 latency by nearly two-thirds, and throughput hits 1.54× the homogeneous baseline. The scheduling layer alone delivers the bulk of that gain, bridging the gap from a naive 1.16×.

This is not a lab experiment, it’s a production-ready design with headroom to spare, scaling beyond 10,000 GPUs on standard Ethernet links. As context windows lengthen and phase-specialized hardware arrives, PrfaaS redefines what’s possible for LLM serving today.

At longer timescales, the scheduler rebalances prefill and decode node counts within the local PD cluster as traffic patterns shift, keeping the system near the throughput-optimal operating point. The Numbers In the case study, a PrfaaS cluster of 32 H200 GPUs is paired with a local PD cluster of 64 H20 GPUs, connected by a VPC network providing approximately 100 Gbps of cross-cluster bandwidth. The aggregate PrfaaS egress load under the optimal configuration is approximately 13 Gbps -- just 13% of available Ethernet capacity -- and the paper notes that the PrfaaS cluster remains compute-bound with substantial bandwidth headroom to spare.

The research also projects this to larger deployments: even at the scale of a 10,000-GPU datacenter, the aggregate egress bandwidth required for KVCache transfer totals only about 1.8 Tbps, well within the capacity of modern inter-datacenter links. Mean Time to First Token (TTFT) drops by 50% and P90 TTFT drops by 64% compared to the homogeneous baseline. The naive heterogeneous configuration -- all prefill on H200, all decode on H20, with no routing or scheduling logic -- achieves only 1.16× throughput over the homogeneous baseline, compared to 1.54× for the full PrfaaS-PD system.

The gap between 1.16× and 1.54× isolates the contribution of the scheduling layer and shows it accounts for the majority of the practical gain. The research team positions PrfaaS not as a near-future concept but as a design that is viable today for hybrid-architecture models -- and argues that as context windows grow, KVCache compression techniques mature, and phase-specialized hardware such as NVIDIA's Rubin CPX for prefill and LPU-style chips for decode become more widely available, the case for cross-datacenter PD disaggregation will only strengthen.

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale - MarkTechPost

The numbers speak for themselves: a 50% drop in mean TTFT, 64% at P90, and 1.54× throughput over the homogeneous baseline. But the real insight lies in the gap between 1.16× and 1.54×. That gap is the scheduling layer.

It’s not the hardware alone; it’s the intelligence that meters KVCache flow across clusters, that rebalances prefill and decode counts in real time, that keeps the system within spitting distance of the throughput-optimal point. Moonshot AI and Tsinghua have built something that works today, on existing networks, with 13 Gbps of egress load leaving 87% of Ethernet capacity untouched. And the projection to 10,000 GPUs?

1.8 Tbps. Well within reach. This is a blueprint for the disaggregated future.

As context windows expand, as compression matures, as specialized silicon like Rubin CPX and LPU chips enter the mainstream, the architecture PrfaaS proposes will only become more natural, even necessary. The bottleneck is no longer bandwidth; it’s orchestration. And that orchestration is now public.

Common Questions Answered

How does PrfaaS improve large language model inference efficiency?

PrfaaS splits LLM processing into separate 'prefill' and 'decode' stages, allowing dynamic resource allocation across clusters. The system can automatically rebalance node counts as traffic patterns change, maintaining near-optimal throughput and addressing the limitations of traditional single-box RDMA deployments.

What hardware configuration did Moonshot AI and Tsinghua use in their PrfaaS case study?

The researchers configured a cluster with 32 H200 GPUs for prefill processing and a local PD cluster with 64 H20 GPUs, interconnected by a VPC network providing approximately 100 Gbps of cross-cluster bandwidth. This setup allows for flexible resource allocation and efficient large language model inference.

Why is the ability to dynamically shift resources between prefill and decode nodes important for LLM performance?

LLM traffic is inherently unpredictable, with varying loads of short and long queries that can crowd specific processing nodes. By periodically rebalancing node counts, PrfaaS can maintain system efficiency and keep throughput near the optimal operating point as traffic patterns evolve.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

PrfaaS: AI Nodes Optimized for LLM Peak Performance

Common Questions Answered

How does PrfaaS improve large language model inference efficiency?

What hardware configuration did Moonshot AI and Tsinghua use in their PrfaaS case study?

Why is the ability to dynamically shift resources between prefill and decode nodes important for LLM performance?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism

Related Reading

ChatGPT's 'Nerdy' tweak rewards goblin metaphors in answers, study finds

Google tests visual 'magazine-style' UI for Gemini 3 Pro users

AI Engineers Face Rising Costs, Need New Strategies for Efficiency

91% of businesses now use video marketing — AI cut the cost of keeping up by 91% too

Anthropic's Claude Opus 4.7 lifts coding benchmark 13% and solves four new tasks

OpenAI API guide demonstrates gpt-4o call, returning 'Late 2024-early 2025

Common Questions Answered

How does PrfaaS improve large language model inference efficiency?

What hardware configuration did Moonshot AI and Tsinghua use in their PrfaaS case study?

Why is the ability to dynamically shift resources between prefill and decode nodes important for LLM performance?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism