Skip to main content
Google AI’s MTP Drafters optimizing Gemma 4 model, cutting inference time by up to threefold for faster AI processing efficie

Editorial illustration for Google AI's MTP Drafters for Gemma 4 cut inference time up to threefold

Google AI's MTP Drafters for Gemma 4 cut inference time...

Google AI's MTP Drafters for Gemma 4 cut inference time up to threefold

Updated: 3 min read

Here’s the thing: Google’s latest Gemma 4 update promises up to three‑times faster inference, and the company insists the quality stays intact. The new Multi‑Token Prediction (MTP) drafters claim to shave seconds off each generation cycle, a claim that matters when you’re running billions of parameters on a single request. But faster isn’t just about adding more GPU cores or cranking up clock speeds.

In practice, the real drag comes from how quickly the model can feed those parameters into the compute pipeline. If the data path stalls, the whole process stalls—no matter how powerful the hardware behind it. That’s why the engineering team zeroed in on memory‑to‑compute bandwidth rather than raw compute horsepower.

The result, they say, is a noticeable cut in latency without sacrificing output fidelity.

Every single token generation requires loading billions of model parameters from VRAM (video RAM) into compute units. The bottleneck is not the raw computing power of the GPU or processor, but the speed at which data can be transferred from memory to the compute units.

The consequence is a significant latency bottleneck: compute sits underutilized while the system is busy just moving data around.

What makes this especially inefficient is that the model applies the same amount of computation to a trivially predictable token like predicting “words” after “Actions speak louder than…” as it does to generating a complex logical inference. There’s no mechanism in standard autoregressive decoding to exploit how easy or hard the next token is to predict.

What is Speculative Decoding?

Speculative decoding is the foundational technique that Gemma 4’s MTP drafters are built on. The technique decouples token generation from verification by pairing two models: a lightweight drafter and a heavy target model.

Here’s how the pipeline works in practice. The small, fast drafter model proposes several future tokens in rapid succession -- a “draft” sequence -- in less time than the large target model (e.g., Gemma 4 31B) takes to process even a single token. The target model then verifies all of these suggested tokens in parallel in a single forward pass. If the target model agrees with the draft, it accepts the entire sequence -- and even generates one additional token of its own in the process.

Why this matters

Is faster inference enough? Google’s new Multi‑Token Prediction drafters promise up to three‑fold speed gains for the Gemma 4 family, while the team asserts output quality stays intact. The bottleneck, as the release notes, lies not in GPU compute power but in shuttling billions of parameters from VRAM to the compute units for each token.

By speculatively decoding multiple tokens at once, the architecture sidesteps that transfer delay, at least in the benchmarks presented. Yet the claim rests on controlled tests; it remains unclear whether diverse production workloads will see the same proportional improvement. Moreover, the announcement does not address how memory bandwidth constraints might evolve with larger models or different hardware configurations.

In practice, developers will need to weigh the reported speedups against any integration overhead the drafters introduce. For now, the evidence points to a notable reduction in inference latency without obvious quality loss, but broader validation will determine how impactful the approach proves across real‑world applications.

Further Reading