Editorial illustration for DeepSeek AI unveils DeepSeek‑V4 with compressed attention for 1 M‑token contexts
DeepSeek-V4: 1M Token AI Model Redefines Context Limits
DeepSeek AI unveils DeepSeek‑V4 with compressed attention for 1 M‑token contexts
DeepSeek AI’s latest release, DeepSeek‑V4, pushes the limits of open‑source language modeling by targeting a one‑million‑token context window. The model’s architecture hinges on two attention tricks—compressed sparse attention and a more radical variant dubbed heavily compressed attention. Both aim to tame the quadratic growth of the key‑value (KV) cache that traditionally caps context length.
While compressed sparse attention trims the cache by selecting a subset of tokens, the heavier approach promises even tighter storage without sacrificing the ability to attend densely across the remaining representations. This shift matters because the KV cache has been the primary bottleneck in scaling context windows, and any reduction directly translates into lower memory footprints and faster inference. Understanding how the newer method restructures KV entries is essential before judging its practical impact.
The following excerpt explains the mechanics behind that aggressive compression strategy.
HCA is more aggressive: it consolidates KV entries of every m′ tokens -- where m′ ≫ m into a single compressed entry, then applies dense attention over those representations. No sparse selection step is needed; the compression ratio itself reduces KV cache size. In the one-million-token setting, DeepSeek-V4-Pro requires only 27% of the single-token inference FLOPs (in equivalent FP8 FLOPs) and 10% of the KV cache size of DeepSeek-V3.2.
DeepSeek-V4-Flash achieves 10% of single-token FLOPs and 7% of KV cache relative to DeepSeek-V3.2. Manifold-Constrained Hyper-Connections (mHC) DeepSeek-V4 replaces conventional residual connections with Manifold-Constrained Hyper-Connections (mHC).
Will a million‑token window become practical for everyday use? DeepSeek‑AI’s preview of the V4 series suggests it might, at least in controlled settings. Both V4‑Pro and V4‑Flash support a native one‑million‑token context, yet they differ sharply in scale: V4‑Pro carries 1.6 trillion total parameters and activates roughly 49 billion per token, while V4‑Flash trims total parameters to 284 billion with 13 billion activations per token.
The key appears to be Heavily Compressed Attention (HCA), which “consolidates KV entries of every m′ tokens—where m′ ≫ m—into a single compressed entry, then applies dense attention over those representations.” By skipping a sparse selection step, the compression ratio directly shrinks the KV cache, a notable efficiency gain for long contexts. However, the release is a preview; performance metrics beyond the compression claim are not disclosed, leaving it unclear whether latency or quality will suffer at such lengths. The models’ affordability at inference time remains to be verified in broader deployments.
For now, DeepSeek‑AI offers a concrete technical approach to the one‑million‑token challenge, though practical impact remains uncertain.
Further Reading
Common Questions Answered
How does DeepSeek-V4 achieve a one-million-token context window?
DeepSeek-V4 uses two innovative attention techniques: compressed sparse attention and heavily compressed attention (HCA). HCA consolidates key-value entries of every m' tokens into a single compressed entry, dramatically reducing the key-value cache size and enabling much longer context windows.
What are the performance differences between DeepSeek-V4-Pro and DeepSeek-V4-Flash?
DeepSeek-V4-Pro has 1.6 trillion total parameters with approximately 49 billion activated per token, while DeepSeek-V4-Flash has 284 billion total parameters with 13 billion activations per token. Both models support a native one-million-token context, but differ significantly in scale and computational requirements.
What problem does Heavily Compressed Attention (HCA) solve in language models?
Heavily Compressed Attention (HCA) addresses the quadratic growth of key-value cache that traditionally limits context length in language models. By consolidating token entries and reducing cache size, HCA enables models like DeepSeek-V4 to achieve dramatically longer context windows with minimal computational overhead.