Skip to main content
M* AI platform showcasing overlapped scheduling for efficient multimodal model deployment and serving, optimizing performance

Editorial illustration for M* introduces overlapped scheduling to streamline multimodal model serving

M* introduces overlapped scheduling to streamline...

M* introduces overlapped scheduling to streamline multimodal model serving

3 min read

Stanford and the University of Washington have teamed up to tackle a gap that’s growing fast in AI infrastructure. While most LLM serving stacks—vLLM, SGLang—still assume a single autoregressive loop (prefill, then token‑by‑token decode), the newest multimodal models break that mold. Systems like BAGEL, Orpheus, Qwen3‑Omni, π0.5 and V‑JEPA 2 stitch together vision encoders, transformer backbones, diffusion or flow heads, audio codecs and world‑model predictors into dataflow graphs that change with each request.

The result is a mix of non‑AR loops, internal parallelism and input‑dependent paths that existing runtimes can’t schedule efficiently. M* is built for that reality. It treats every request as a walk on a composite graph, offering a single runtime that can handle vision‑language, speech‑language and world‑model workloads alike.

In early benchmarks on the Qwen3‑Omni TTS task, M* delivered almost 2.7× the throughput of vLLM‑Omni and four times that of SGLang‑Omni, while keeping real‑time factor lower than both. The code is on GitHub, the paper on arXiv, and the authors can be reached at [email protected].

- Overlapped scheduling: while the current step runs on the GPU, M* prepares the next batch and its attention plan on a separate stream, and keeps loops moving by deferring each stop check by one iteration. This is implemented generically over the Loop primitive -- not just for text or speculative decoding -- so the GPU rarely stalls on CPU scheduling. - Sharding × disaggregation: tensor-parallel sharding (parallel linears, vocab-parallel embeddings, sharded MoE and KV cache, NCCL collectives) is built in and set with a tp_size in the placement file, so one large component doesn't have to fit on one GPU.

-- Matching or beating specialized systems We instantiate M* on five real models and compare against the strongest specialized baseline for each. For image generation and editing (Figure 5), M* runs BAGEL's three-way classifier-free guidance as a Parallel block spread across three GPUs, and finishes faster than every vLLM-Omni configuration: about 1.3x lower end-to-end latency on text-to-image, and up to 2.6x on image editing versus vLLM-Omni's default pipeline.

Why this matters

M* offers a new way to serve multimodal models that no longer rely on simple token streams. Can this approach keep pace with the rapid growth of composite AI pipelines? By treating each model as a dataflow graph, the system lets requests traverse arbitrary pipelines, which could simplify deployment of composite architectures.

Overlapped scheduling is the headline feature: while one step executes on the GPU, a separate stream builds the next batch and its attention plan, postponing stop checks by one iteration. This generic implementation over the Loop primitive suggests it could work beyond text or speculative decoding. For developers, the promise of tighter loops and fewer idle GPU cycles is attractive, yet the paper does not quantify latency reductions or resource trade‑offs.

Founders may see a path to more efficient inference services, but integration with existing stacks remains unclear. Researchers will likely probe whether the approach scales to larger graphs or heterogeneous hardware. In short, M* pushes serving toward the flexibility of modern models, though its real‑world impact will depend on empirical performance data and ecosystem support.

Further Reading