Editorial illustration for Moonshot AI, Tsinghua unveil PrfaaS KVCache that auto‑balances LLM nodes for throughput
PrfaaS: AI Nodes Optimized for LLM Peak Performance
Moonshot AI, Tsinghua unveil PrfaaS KVCache that auto‑balances LLM nodes for throughput
Moonshot AI and researchers from Tsinghua have rolled out a new cross‑datacenter KVCache system they call PrfaaS. The design promises to keep large language models humming efficiently, even as request loads ebb and flow. By splitting work between a “prefill” stage that prepares prompts and a “decode” stage that generates tokens, the architecture can shift resources on the fly.
That flexibility matters because LLM traffic isn’t steady; bursts of short queries can crowd decode nodes, while long prompts can starve prefill capacity. In their prototype, the team wired together a PrfaaS cluster of 32 H200 GPUs with a local PD cluster of 64 GPUs. The real test, however, is how the scheduler reacts over minutes and hours, nudging node counts to match shifting patterns and staying close to the throughput‑optimal sweet spot.
At longer timescales, the scheduler rebalances prefill and decode node counts within the local PD cluster as traffic patterns shift, keeping the system near the throughput-optimal operating point. The Numbers In the case study, a PrfaaS cluster of 32 H200 GPUs is paired with a local PD cluster of 64 H20 GPUs, connected by a VPC network providing approximately 100 Gbps of cross-cluster bandwidth. The aggregate PrfaaS egress load under the optimal configuration is approximately 13 Gbps -- just 13% of available Ethernet capacity -- and the paper notes that the PrfaaS cluster remains compute-bound with substantial bandwidth headroom to spare.
The research also projects this to larger deployments: even at the scale of a 10,000-GPU datacenter, the aggregate egress bandwidth required for KVCache transfer totals only about 1.8 Tbps, well within the capacity of modern inter-datacenter links. Mean Time to First Token (TTFT) drops by 50% and P90 TTFT drops by 64% compared to the homogeneous baseline. The naive heterogeneous configuration -- all prefill on H200, all decode on H20, with no routing or scheduling logic -- achieves only 1.16× throughput over the homogeneous baseline, compared to 1.54× for the full PrfaaS-PD system.
The gap between 1.16× and 1.54× isolates the contribution of the scheduling layer and shows it accounts for the majority of the practical gain. The research team positions PrfaaS not as a near-future concept but as a design that is viable today for hybrid-architecture models -- and argues that as context windows grow, KVCache compression techniques mature, and phase-specialized hardware such as NVIDIA's Rubin CPX for prefill and LPU-style chips for decode become more widely available, the case for cross-datacenter PD disaggregation will only strengthen.
The paper proposes a cross‑datacenter KVCache that lets prefill and decode run on separate clusters. By treating prefill as a service, the design hopes to free LLM inference from the “single‑box” limitation imposed by current RDMA‑centric deployments. A scheduler periodically shifts node counts between the local PD cluster and the PrfaaS cluster, aiming to keep throughput near an optimal point as traffic patterns evolve.
In the reported case study, the researchers paired 32 H200 GPUs for prefill with 64 GPUs handling decode, and observed the system staying close to its target operating region. Yet the evaluation is limited to a single configuration; it is unclear whether the approach scales to larger, more heterogeneous environments or how it performs under real‑world load spikes. The authors argue the architecture is ready to exploit emerging network flexibility, but broader validation across varied workloads remains pending.
For now, the evidence suggests a modest throughput benefit in the tested setup, while the broader impact on LLM serving infrastructure is still uncertain.
Further Reading
- Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter - arXiv
- Mooncake: Kimi's KVCache-centric Architecture for LLM Serving - AlphaXiv
- A KVCache-centric Architecture for Serving LLM Chatbot - USENIX
- PrfaaS: Cross-Datacenter LLM Serving via KVCache - YouTube (AI Research Roundup)
- KVCache.AI - MADSys Research Project - Tsinghua University MADSys
Common Questions Answered
How does PrfaaS improve large language model inference efficiency?
PrfaaS splits LLM processing into separate 'prefill' and 'decode' stages, allowing dynamic resource allocation across clusters. The system can automatically rebalance node counts as traffic patterns change, maintaining near-optimal throughput and addressing the limitations of traditional single-box RDMA deployments.
What hardware configuration did Moonshot AI and Tsinghua use in their PrfaaS case study?
The researchers configured a cluster with 32 H200 GPUs for prefill processing and a local PD cluster with 64 H20 GPUs, interconnected by a VPC network providing approximately 100 Gbps of cross-cluster bandwidth. This setup allows for flexible resource allocation and efficient large language model inference.
Why is the ability to dynamically shift resources between prefill and decode nodes important for LLM performance?
LLM traffic is inherently unpredictable, with varying loads of short and long queries that can crowd specific processing nodes. By periodically rebalancing node counts, PrfaaS can maintain system efficiency and keep throughput near the optimal operating point as traffic patterns evolve.