Two Decades of Failed Video Pixel Prediction Reveal World’s Messy Reality
Two decades of research have chased a simple promise: if language models can guess the next word, why not make them guess the next video frame? The idea has haunted labs since the early 2000s, spawning countless papers, benchmarks and hefty compute budgets. Yet every experiment seems to hit the same wall—pixel‑level forecasts crumble when confronted with the chaotic swirl of real‑world motion.
Researchers have watched models churn out blurry blobs where a ball should bounce, or freeze when lighting shifts, exposing a gap between statistical prediction and genuine physical insight. The disappointment isn’t just academic; it questions a broader ambition to stitch together language and vision into a single, reasoning engine. As the field pivots, the conversation is shifting from “more data, bigger models” to “different architectures altogether.” The following statement captures why the community is reconsidering its core assumptions.
Attempts to transfer the principle of text prediction to the pixel level of video have failed over the last 20 years. The world is too "messy" and noisy for exact pixel prediction to lead to an understanding of physics or causality. New architectures needed for physical understanding To support his thesis, LeCun points to the massive inefficiency of current AI systems compared to biological brains. An LLM might be trained on roughly 30 trillion words -- a volume of text that would take a human half a million years to read.
Is predicting the next token enough to reach human‑like reasoning? LeCun says no. For two decades researchers have tried to copy text‑prediction tricks on video pixels and the results have been disappointing.
The world, he notes, is too noisy for exact pixel forecasts to capture physics or causality. Consequently, the promise that large language models will naturally evolve into artificial general intelligence looks shaky. While ChatGPT and Gemini dominate headlines, their underlying mechanism—next‑token prediction—has not demonstrated the ability to build a coherent model of the physical world.
New architectures, perhaps grounded in causal inference, are being called for, but no concrete design has yet emerged. It is unclear whether abandoning the prediction‑centric paradigm will yield the sought‑after understanding. The debate, moderated by Janna, underscores a split among top scientists about the path forward.
Until evidence shows otherwise, the case against relying solely on token or pixel prediction remains compelling. Skepticism persists in the research community.
Further Reading
- Video Prediction of Dynamic Physical Simulations with Pixel-Space Transformers - arXiv
- The case against predicting tokens to build AGI - The Decoder
- Video Models of People and Pixels - UC Berkeley EECS
Common Questions Answered
Why have video pixel prediction attempts failed over the past two decades?
Researchers found that pixel‑level forecasts crumble when faced with the chaotic swirl of real‑world motion, often producing blurry blobs or freezing frames. The article explains that the world is too "messy" and noisy for exact pixel prediction to capture physics or causality, indicating a need for new architectures.
What does LeCun argue about the efficiency of current AI systems compared to biological brains?
LeCun points out that large language models require training on roughly 30 trillion words, a volume vastly larger than what a biological brain processes. This massive data requirement highlights the inefficiency of current AI systems relative to the brain's far more economical learning mechanisms.
How does the article describe the relationship between predicting the next token and achieving human‑like reasoning?
The article states that simply predicting the next token is insufficient for human‑like reasoning, as demonstrated by two decades of disappointing video‑pixel experiments. LeCun concludes that the promise of large language models naturally evolving into artificial general intelligence is shaky.
What evidence does the article provide that models like ChatGPT and Gemini dominate headlines but may not lead to AGI?
While ChatGPT and Gemini attract media attention, the article notes that their underlying mechanism is still token prediction, which has not succeeded in translating to physical understanding in video tasks. The continued failure of pixel‑level prediction underscores doubts about these models achieving true artificial general intelligence.