Researchers optimize LLMs with masked tokens for 3x faster inference, reducing visual tokens [arxiv.org].

Editorial illustration for Researchers embed mask token in LLM weights to achieve 3× faster inference

LLM Inference Slashed: Mask Token Hack Boosts Speed 3x

Researchers embed mask token in LLM weights to achieve 3× faster inference

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

February 23, 2026 • Updated: July 4, 2026 • 4 min read

Everyone wants faster AI. Almost nobody wants to rebuild their entire system to get it. Researchers have just shown a way to get both: three times faster inference by tweaking the model's own brain, not its house.

The trick is a mask token. They embedded it directly into the model's weights, hijacking a slot in its vocabulary that wasn't being used. This token lets the model, when it's feeling confident, predict several words in one go instead of the usual plodding, one-word-at-a-time crawl.

The entire existing architecture, any production pipeline, stays exactly the same. You just get more words per second.

By co-opting an unused slot in a model's existing embedding matrix to act as an mask token, the technique converts sequential operations into parallel ones. "Any standard next token prediction language model can be adapted in this way... the internal implementation -- MoE, windowed attention, SSM layers, etc.

-- are left untouched and present no barrier to adaptation." For engineering teams, this means the adaptation can be applied to models already in production without rebuilding pipelines. Generating multiple tokens at the same time can still hurt the accuracy of the response at inference time. To maximize generation speed without sacrificing the quality of the output, the authors introduce an adaptive decoding strategy called ConfAdapt.

ConfAdapt evaluates a confidence threshold, such as 90%, at each step. The model generates a block of tokens, but it only keeps the tokens that meet or exceed this high-confidence threshold. When the upcoming text is highly predictable or structural, the model's confidence is very high.

It will accept and output a large chunk of tokens all at once, saving significant computational time on easy tokens. It then focuses its costly single-token passes on harder tokens that require more computational effort. Putting multi-token prediction to the test To see how the training paradigm performed in practice, the researchers applied their method to popular open-weight instruction-tuned models.

They tested the strong general-purpose model Llama-3.1-8B-Magpie and the smaller, efficient Qwen3-4B-Instruct-2507, which is often chosen for cost-sensitive enterprise deployments. Both models were tuned on MetaMathQA, a dataset of synthetic grade school math problems that rely heavily on reasoning traces. The experiments revealed a clear sweet spot between speed and accuracy.

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding - VentureBeat AI

Raw speed is useless if the model starts spitting nonsense. So they added a guardrail called ConfAdapt. The model checks its own confidence.

For boilerplate or simple text, it blasts through whole phrases. When the next word is tricky, it downshifts, taking the careful, expensive route. It knows when to sprint and when to walk.

Tests on models like Llama-3.1-8B and Qwen3-4B confirmed it works. The method found a balance. Not maximum possible speed, but maximum useful speed.

The gain is real, and it doesn't demand a revolution. It's an edit. For teams already running these models, the slow part just got a lot quieter.

Common Questions Answered

How does FastMTP improve LLM inference performance?

FastMTP accelerates LLM inference by fine-tuning a multi-token prediction (MTP) head with position-shared weights, enabling it to capture dependencies among consecutive future tokens. The method achieves an average 2.03× speedup compared to standard next token prediction, outperforming vanilla MTP by 82% while maintaining output quality.

What makes FastMTP different from existing speculative decoding approaches?

Unlike traditional speculative decoding methods, FastMTP integrates language-aware dynamic vocabulary compression into the MTP head to reduce computational overhead during the drafting process. The approach requires only lightweight training and can seamlessly integrate with existing inference frameworks, offering a practical solution for accelerating LLM inference.

Why is autoregressive token generation a bottleneck for large language models?

Current LLMs generate text sequentially, producing only one token per forward pass, which means the overall generation time scales linearly with sequence length. This becomes particularly problematic for scenarios requiring extensive generation, such as complex reasoning tasks that involve generating long chain-of-thought explanations.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

LLM Inference Slashed: Mask Token Hack Boosts Speed 3x

Common Questions Answered

How does FastMTP improve LLM inference performance?

What makes FastMTP different from existing speculative decoding approaches?

Why is autoregressive token generation a bottleneck for large language models?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism

Related Reading

Google's FACTS benchmark shows 70% factuality ceiling across four tests

Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK

Nvidia's DLSS 4.5 beta adds 6x Multi Frame Generation for RTX 50 GPUs

Run:ai on 64 GPUs serves 10,200 users, matching native scheduler

Google unveils Gemini 3.1 Pro, hits 94.3% GPQA Diamond and coding Elo 2

Common Questions Answered

How does FastMTP improve LLM inference performance?

What makes FastMTP different from existing speculative decoding approaches?

Why is autoregressive token generation a bottleneck for large language models?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism