DeepSeek AI introduces DeepSeek-V4 AI model featuring compressed attention for processing 1 million-token contexts, showcasin

Editorial illustration for DeepSeek AI unveils DeepSeek‑V4 with compressed attention for 1 M‑token contexts

DeepSeek-V4: 1M Token AI Model Redefines Context Limits

DeepSeek AI unveils DeepSeek‑V4 with compressed attention for 1 M‑token contexts

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

April 24, 2026 • Updated: April 28, 2026 • 2 min read

DeepSeek AI’s latest release, DeepSeek‑V4, pushes the limits of open‑source language modeling by targeting a one‑million‑token context window. The model’s architecture hinges on two attention tricks—compressed sparse attention and a more radical variant dubbed heavily compressed attention. Both aim to tame the quadratic growth of the key‑value (KV) cache that traditionally caps context length.

While compressed sparse attention trims the cache by selecting a subset of tokens, the heavier approach promises even tighter storage without sacrificing the ability to attend densely across the remaining representations. This shift matters because the KV cache has been the primary bottleneck in scaling context windows, and any reduction directly translates into lower memory footprints and faster inference. Understanding how the newer method restructures KV entries is essential before judging its practical impact.

The following excerpt explains the mechanics behind that aggressive compression strategy.

HCA is more aggressive: it consolidates KV entries of every m′ tokens -- where m′ ≫ m into a single compressed entry, then applies dense attention over those representations. No sparse selection step is needed; the compression ratio itself reduces KV cache size. In the one-million-token setting, DeepSeek-V4-Pro requires only 27% of the single-token inference FLOPs (in equivalent FP8 FLOPs) and 10% of the KV cache size of DeepSeek-V3.2.

DeepSeek-V4-Flash achieves 10% of single-token FLOPs and 7% of KV cache relative to DeepSeek-V3.2. Manifold-Constrained Hyper-Connections (mHC) DeepSeek-V4 replaces conventional residual connections with Manifold-Constrained Hyper-Connections (mHC).

DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts - MarkTechPost

Will a million‑token window become practical for everyday use? DeepSeek‑AI’s preview of the V4 series suggests it might, at least in controlled settings. Both V4‑Pro and V4‑Flash support a native one‑million‑token context, yet they differ sharply in scale: V4‑Pro carries 1.6 trillion total parameters and activates roughly 49 billion per token, while V4‑Flash trims total parameters to 284 billion with 13 billion activations per token.

The key appears to be Heavily Compressed Attention (HCA), which “consolidates KV entries of every m′ tokens—where m′ ≫ m—into a single compressed entry, then applies dense attention over those representations.” By skipping a sparse selection step, the compression ratio directly shrinks the KV cache, a notable efficiency gain for long contexts. However, the release is a preview; performance metrics beyond the compression claim are not disclosed, leaving it unclear whether latency or quality will suffer at such lengths. The models’ affordability at inference time remains to be verified in broader deployments.

For now, DeepSeek‑AI offers a concrete technical approach to the one‑million‑token challenge, though practical impact remains uncertain.

Common Questions Answered

How does DeepSeek-V4 achieve a one-million-token context window?

DeepSeek-V4 uses two innovative attention techniques: compressed sparse attention and heavily compressed attention (HCA). HCA consolidates key-value entries of every m' tokens into a single compressed entry, dramatically reducing the key-value cache size and enabling much longer context windows.

What are the performance differences between DeepSeek-V4-Pro and DeepSeek-V4-Flash?

DeepSeek-V4-Pro has 1.6 trillion total parameters with approximately 49 billion activated per token, while DeepSeek-V4-Flash has 284 billion total parameters with 13 billion activations per token. Both models support a native one-million-token context, but differ significantly in scale and computational requirements.

What problem does Heavily Compressed Attention (HCA) solve in language models?

Heavily Compressed Attention (HCA) addresses the quadratic growth of key-value cache that traditionally limits context length in language models. By consolidating token entries and reducing cache size, HCA enables models like DeepSeek-V4 to achieve dramatically longer context windows with minimal computational overhead.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

DeepSeek-V4: 1M Token AI Model Redefines Context Limits

Further Reading

Common Questions Answered

How does DeepSeek-V4 achieve a one-million-token context window?

What are the performance differences between DeepSeek-V4-Pro and DeepSeek-V4-Flash?

What problem does Heavily Compressed Attention (HCA) solve in language models?

Latest News

Apple unveils third‑gen foundation model, AFM 3 Cloud shows 36% boost

NVFP4 recipe speeds JAX/MaxText training on NVIDIA Blackwell and Rubin

Weaker LLMs Accidentally Delete Content, Shrinking Documents Over Time

Four New Specific Techniques to Boost Productivity with Claude Code

LangChain Emergency Helpline Uses AssemblyAI WebSocket for Live STT

Jensen Huang sees token market segmenting into distinct value tiers

OpenAI to revamp ChatGPT, shift to business customers, rival Anth

MLP Networks Fit High-Frequency Functions One Oscillation at a Time

Moonshot AI seeks USD 30 billion valuation, plans USD 1‑2 billion fundraise

PyTorch nn.Module’s call runs system setup and hooks before forward

Further Reading

Related Reading

Meta launches Hatch AI agent, its first paid product, priced up to USD 200/month

Tailwind CSS Survives AI Onslaught: 75 Million Monthly Downloads Keep It Afloat

India proposes licensing and royalty rules for AI training by Google, OpenAI

DeepSeek‑V4‑Pro‑Max tops open models, nears closed results at 1/6 Opus 4.7 cost

Update: Usage Limits Draining Faster Linked to Two Unrelated Experiments

Common Questions Answered

How does DeepSeek-V4 achieve a one-million-token context window?

What are the performance differences between DeepSeek-V4-Pro and DeepSeek-V4-Flash?

What problem does Heavily Compressed Attention (HCA) solve in language models?

Latest News

Apple unveils third‑gen foundation model, AFM 3 Cloud shows 36% boost

NVFP4 recipe speeds JAX/MaxText training on NVIDIA Blackwell and Rubin

Weaker LLMs Accidentally Delete Content, Shrinking Documents Over Time

Four New Specific Techniques to Boost Productivity with Claude Code

LangChain Emergency Helpline Uses AssemblyAI WebSocket for Live STT

Jensen Huang sees token market segmenting into distinct value tiers

OpenAI to revamp ChatGPT, shift to business customers, rival Anth

MLP Networks Fit High-Frequency Functions One Oscillation at a Time

Moonshot AI seeks USD 30 billion valuation, plans USD 1‑2 billion fundraise

PyTorch nn.Module’s __call__ runs system setup and hooks before forward

PyTorch nn.Module’s call runs system setup and hooks before forward