Skip to main content
AI researchers analyze OpenAI Sora and Google Veo, questioning their "world model" capabilities.

Editorial illustration for Researchers say OpenAI's Sora and Google's Veo aren't true world models

OpenAI Sora & Google Veo: Not True World Models Yet

Researchers say OpenAI's Sora and Google's Veo aren't true world models

2 min read

The buzz around text‑to‑video AI has been louder than the tech itself. OpenAI’s Sora, launched and then pulled, was instantly tagged a “world simulator” by many observers. Across the Pacific, Google’s Veo drew similar hype, with DeepMind chief Demis Hassabis framing it as a stride toward genuine world modeling.

Yet the academic community has begun to draw a line between flashy demos and the formal criteria of a world model. In a recent benchmark paper, the authors lay out a precise definition—one that hinges on a system’s ability to predict and reason about arbitrary future states, not just stitch together plausible video clips. Their analysis puts Sora and Veo squarely outside that boundary, aligning with critiques from figures like Yann LeCun.

The stakes are practical as well as theoretical: mislabeling these generators could steer research funding and expectations down a misleading path.

When OpenAI rolled out its now‑discontinued Sora video model, plenty of people called it a “world simulator.” Deepmind CEO Demis Hassabis made similar claims about Google's Veo video model, positioning it as a step toward world models. The authors flat‑out disagree, landing on the same side as Yann

When OpenAI rolled out its now-discontinued Sora video model, plenty of people called it a "world simulator." Deepmind CEO Demis Hassabis made similar claims about Google's Veo video model, positioning it as a step toward world models. The authors flat-out disagree, landing on the same side as Yann LeCun: while video generation shows some grasp of physical relationships, it's missing the crucial feedback loop with the real world. A model that only generates videos from text doesn't perceive its environment and doesn't interact with it. Text-to-video therefore falls "outside the core tasks of world models," the paper states.

While the new framework offers a clear checklist—perception, interaction, memory—its strict criteria immediately disqualify current text‑to‑video systems. Sora, now discontinued, and Google's Veo generate footage from prompts, yet they receive no sensory feedback from an environment. Consequently, they cannot satisfy the interaction clause, nor can they retain episodic memory beyond the generated clip.

The authors therefore push back against the label of “world simulator” applied by OpenAI and DeepMind executives. Their stance aligns with earlier criticism from Yann LeCun, suggesting a broader unease about conflating generative video with genuine world modeling. However, the paper does not address whether future iterations might incorporate real‑time sensors or closed‑loop training, leaving that possibility uncertain.

The community now has a concrete definition to test against, but adoption will depend on how quickly developers can meet the three‑pronged requirement. Until then, the claim that Sora or Veo represent true world models remains unsubstantiated. The debate continues.

Further Reading

Common Questions Answered

Why do researchers argue that Sora and Veo are not true world models?

Researchers argue that Sora and Veo lack a crucial feedback loop with the real world, which is essential for true world modeling. These AI systems generate videos from text prompts but cannot interact with or learn from an actual environment, failing to meet the key criteria of perception, interaction, and memory.

What specific criteria do researchers use to define a world model?

The researchers outline a framework that requires three key elements: perception, interaction, and memory. Current text-to-video AI systems like Sora and Veo can only generate videos from prompts, but cannot receive sensory feedback, interact with an environment, or retain episodic memory beyond the generated clip.

How do prominent AI researchers like Yann LeCun view current text-to-video AI models?

Yann LeCun and other researchers are skeptical of claims that current text-to-video AI models are true world simulators. While these models show some understanding of physical relationships, they fundamentally lack the ability to learn and interact with the real world in a meaningful way.