Editorial illustration for TurboQuant and OSCAR vie in KV cache compression race at ICLR 2026
TurboQuant and OSCAR vie in KV cache compression race at...
TurboQuant and OSCAR vie in KV cache compression race at ICLR 2026
Why does this matter? Because long‑context LLMs run into a memory bottleneck that isn’t about model weights at all. While the model computes attention, transformers cache a key and a value vector for every token at every layer. That KV cache expands linearly with sequence length and batch size, and at high concurrency it can outsize the model itself.
Take Llama‑3.1‑70B in BF16. Its cache costs roughly 0.31 MB per token—80 layers, 8 heads, 128‑dimensional heads, two tensors, two bytes each. At 128 K tokens you’re looking at about 40 GB; push to a million tokens and the cache swallows more than 300 GB, dwarfing the 140 GB of weights.
Every new token forces the whole cache out of high‑bandwidth memory, turning decoding into a memory‑bandwidth problem rather than a compute one. Shrinking the KV cache is therefore the most direct lever for cutting cost and latency.
Current work clusters into five families: token eviction (H2O, SnapKV), quantization (KIVI, GEAR), low‑rank projection (Palu), merging (KVMerger) and architectural sharing (MLA). In 2026, Google‑NYU’s TurboQuant and Together AI’s OSCAR push ultra‑low‑bit quantization from opposite angles, while Apple’s EpiCache attacks a different slice of the problem. Most KV quantizers wrestle with outlier channels that dominate the quantization range.
Google and NYU’s TurboQuant (ICLR 2026) and Together AI’s OSCAR attack the same problem from opposite directions, while Apple’s EpiCache tackles a problem neither one addresses.
Most KV quantizers are fighting the same underlying enemy: outlier channels -- a handful of channels with disproportionately large magnitudes that dominate the quantization range and squeeze the rest of the signal into just a few representable levels. This is why naive INT2 quantization (only four levels) collapses to near-zero accuracy.
KIVI established the standard baseline here. It showed that key vectors have fixed outlier channels across tokens while value vectors do not, so it quantizes keys per-channel and values per-token. That tuning-free 2-bit recipe cuts end-to-end peak memory (weights included) by about 2.6×, and it is the reference point the newer methods build on.
TurboQuant: data-oblivious and theoretically optimal
TurboQuant handles outliers without ever looking at your data, in two stages:
- Stage one: each vector is randomly rotated so its coordinates become nearly independent and approximately Gaussian, which lets an optimal precomputed scalar (Lloyd-Max) quantizer be applied per coordinate.
- Stage two: a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform is applied to the residual, giving a provably unbiased estimate of attention logits with no normalization-constant overhead.
The selling point is theoretical: TurboQuant’s distortion is provably within a small constant factor (≈ 2.7×) of the information-theoretic lower bound. In practice it reaches essentially full-precision recall on Needle-in-a-Haystack at 4× compression, and the paper reports absolute quality neutrality at 3.5 bits and only marginal degradation at 2.5 bits per channel.
Why this matters Can we finally run LLMs with truly long contexts without exhausting GPU memory? The KV‑cache bottleneck, separate from model weights, still limits practical deployment, especially when batch sizes grow. TurboQuant from Google and NYU, OSCAR from Together AI, and Apple’s EpiCache each propose a different compression angle, attacking the same problem from opposite directions.
For developers, this means an emerging choice set: quantized caches, algorithmic pruning, or hardware‑aware tricks. Founders will need to weigh integration effort against potential cost savings, because the cache can dwarf the model’s own footprint in high‑concurrency scenarios. Researchers gain fresh data points on how compression impacts decoding latency and accuracy, yet the article offers no clear benchmark hierarchy.
It remains unclear whether one approach will dominate or whether hybrid solutions will emerge. We should watch early adopters for real‑world performance numbers before committing large‑scale resources. In short, the race introduces viable paths forward, but practical superiority is still uncertain.
Further Reading
- TurboQuant: Redefining AI efficiency with extreme compression - Google Research Blog
- TurboQuant: 6x KV Cache Compression for LLM Inference - Spheron Blog
- TurboQuant: Online Vector Quantization with Near-optimal Distortion... - arXiv
- Tutorial: Compress the KV Cache with TurboQuant and Haystack - deepset Haystack
- TurboQuant support discussion in vLLM forums - vLLM Forums