Qwen3‑VL scans two‑hour videos, hits 96.5% on DocVQA, 875 OCRBench
Alibaba’s newest multimodal model, Qwen3-VL, seems to stretch what open-source vision-language systems can manage. It can take in a two-hour video and pull out frame-by-frame details, a task that many research prototypes would probably choke on.