Editorial illustration for DFlash drafts whole token blocks, achieving 15× throughput on NVIDIA Blackwell
DFlash drafts whole token blocks, achieving 15×...
DFlash drafts whole token blocks, achieving 15× throughput on NVIDIA Blackwell
Autoregressive large‑language models still write one token after another, forcing each new word to wait for its predecessor. That serial loop leaves modern GPUs half‑asleep and makes inference sluggish, especially when a model is asked to produce long chain‑of‑thought explanations. The latency of those lengthy outputs quickly becomes the bottleneck.
Enter speculative decoding, the go‑to workaround. A small draft model shoots out candidate tokens, while the big target model checks them in parallel; any token that passes stays, preserving a lossless output. Yet even the newest contender, EAGLE‑3, drafts tokens one at a time, so real‑world speedups hover around two to three times.
DFlash, a project from UC San Diego’s z‑lab, takes a different tack. Built as a lightweight block‑diffusion model, it drafts an entire block of tokens in a single forward pass, after which the target model verifies the whole block concurrently. The team reports more than six‑fold lossless acceleration across various models and tasks, topping EAGLE‑3 by up to 2.5×. NVIDIA’s engineers add that on Blackwell hardware, gpt‑oss‑120b can reach up to 15× higher throughput while keeping the same interactive latency.
At the 500-600 tokens/sec per-user range, DFlash serves more than 15× the throughput of autoregressive decoding. That is about 1.5× more than EAGLE-3 at the same point.
The table below shows the paper’s per-task speedups on Qwen3-8B at temperature 0 (Transformers backend).
Task (Qwen3-8B, temp=0) Baseline EAGLE-3 (16) DFlash (16) DFlash τ GSM8K 1.00× 1.94× 5.15× 6.54 MATH-500 1.00× 1.81× 6.08× 7.87 AIME25 1.00× 1.79× 5.62× 7.08 HumanEval 1.00× 1.89× 5.14× 6.50 MBPP 1.00× 1.69× 4.65× 5.95 LiveCodeBench 1.00× 1.57× 5.51× 7.27 MT-Bench 1.00× 1.63× 2.75× 4.24 Average 1.00× 1.76× 4.86× 6.49 A separate NVIDIA Speed-Bench comparison measures interactivity speedups at matched concurrency. On gpt-oss-120b, DFlash averages 2.3× versus EAGLE-3’s 1.7×. On Llama 3.1 8B Instruct, DFlash averages 2.8× versus EAGLE-3’s 2.2×.
Use cases with examples
DFlash targets latency-sensitive serving where token-by-token generation hurts.
Why this matters
We see DFlash turning the serial bottleneck of autoregressive generation into a parallel draft of whole token blocks, delivering up to 15× the throughput of traditional decoding on NVIDIA’s Blackwell GPUs. For developers wrestling with latency‑heavy chain‑of‑thought prompts, that jump from 500‑600 tokens per second per user to multi‑kilohertz rates could shrink response times dramatically. Yet the paper only reports results on Qwen3‑8B at temperature 0, and it's unclear whether the same gains translate to larger models or different temperature settings.
Founders eyeing cost reductions should note the 1.5× edge over EAGLE‑3 at the same throughput point, but the hardware‑specific nature of the benchmark suggests the advantage may evaporate on older or non‑Blackwell accelerators. Researchers might appreciate the speculative decoding twist—drafting whole blocks instead of single tokens—but the verification step still relies on the large target model, leaving open questions about overall energy efficiency. In short, DFlash offers a compelling speedup for a narrow slice of the current inference stack; broader impact will depend on reproducibility across architectures and model families.
Further Reading
- DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for up to 15x Higher Throughput on NVIDIA Blackwell - MarkTechPost
- Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding - NVIDIA Developer Blog
- DFlash: Block Diffusion for Flash Speculative Decoding - arXiv
- The next generation of speculative decoding: DFlash and Spec V2 - LMSYS Org
- DFlash Speculative Decoding Accelerates NVIDIA Blackwell - Hyper.ai