Editorial illustration for OmniMem adds modality-aware memory allocation for audio‑visual LLMs
OmniMem adds modality-aware memory allocation for...
OmniMem adds modality-aware memory allocation for audio‑visual LLMs
Audio‑visual large language models promise to decode hours‑long video, but their inference cost climbs with every extra frame and sound snippet. The culprit? A linear swell of token counts and the accompanying key‑value cache that quickly outgrows available memory.
OmniMem steps in as a streaming‑oriented framework built for these multimodal systems. Instead of lumping every token together, it splits storage between visual and auditory streams, directly confronting the disproportionate token loads each modality brings. The system also watches how sensitive the cache is to small changes, pruning entries that add little new information while keeping the rest intact.
To make the approach viable in real‑world settings, the authors add a fine‑tuning phase that nudges the model to pack useful cues into the limited slots it retains. Benchmarks on VideoMME Long, LVBench and LVOmniBench—using video‑SALMONN 2+ and Qwen‑2.5‑Omni—show OmniMem nudging accuracy up by 2–4 percentage points over strong, training‑free baselines, with another 1–2 points after the fine‑tuning step. The result is a leaner, still‑sharp memory footprint for long‑form video understanding.
Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.
Why this matters
Can audio‑visual LLMs finally handle hour‑long streams? By allocating memory separately for visual and audio tokens, the framework directly tackles the token imbalance that has hampered existing compression schemes. A step forward.
We appreciate that the approach preserves informative and non‑redundant content, which suggests fewer dropped details during streaming. However, the paper does not disclose real‑world latency or hardware costs, leaving developers to wonder about deployment feasibility. Founders may see a path to more scalable video products, yet the lack of benchmark comparisons makes the advantage unclear.
Researchers will likely probe whether modality‑aware allocation scales across diverse datasets, or if it merely shifts complexity elsewhere. The modality‑aware strategy could inspire similar designs in multimodal transformers, though we have yet to see whether it integrates smoothly with existing pipelines. In short, OmniMem offers a targeted solution to a known bottleneck, but its practical impact remains to be validated through broader testing.
For now, teams interested in long‑form video should evaluate OmniMem alongside baseline compression to gauge trade‑offs.
Further Reading
- OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs - arXiv
- OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs - arXiv
- Audio-Visual LLMs: Fusion, Tuning & Efficiency - Emergent Mind
- LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos - CVPR 2025
- OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni-Modal Models - TL;DR Takara