Skip to main content
Illustration for: AI vision pioneer aims to extend models from data to space understanding

AI vision pioneer aims to extend models from data to space understanding

3 min read

The researcher who first gave machines the ability to recognize objects is now looking up, at space. After years of building vision models that could label cats, read street signs and sort millions of pictures, he says the next frontier isn’t more data, it’s the world itself. He points out the gap between what AI can do on a screen and what it can actually do in a room, on a street, or even in orbit.

That matters because the same algorithms that run chatbots and recommendation engines are being pushed into robotics, self-driving cars and satellite navigation. Every test run, however, seems to hit a wall when the system has to move from pixels to physics. In his latest paper he lays out the issue plainly: today’s top models dominate reading, writing and pattern-finding, but they stumble when they need to represent or interact with the physical world.

*Current state-of-the-art AI can excel at reading, writing, research and pattern recognition in data, yet these same models have fundamental limits when it comes to representing or interacting with the physical world,* Li writes. Humans, on the other hand, appear to blend perception and meaning seamlessly.

"While current state‑of‑the‑art AI can excel at reading, writing, research, and pattern recognition in data, these same models bear fundamental limitations when representing or interacting with the physical world," Li writes. Humans, on the other hand, seem to integrate perception and meaning seamlessly. We don't just recognize a coffee mug, we instantly grasp its size, its weight, and where it sits in space.

That implicit spatial reasoning, says Li, is something AI still lacks entirely. The cognitive scaffold that made us intelligent Li traces the roots of intelligence back to the simplest perceptual loops. Long before animals could nurture offspring or communicate, they sensed and moved, starting a feedback cycle that eventually gave rise to thought itself.

That same ability underpins everything from driving a car to sketching a building or catching a ball. Words can describe these acts, but they can't reproduce the intuition behind them. "Thus, many scientists have conjectured that perception and action became the core loop driving the evolution of intelligence," Li notes.

How spatial insight powered human discovery Throughout history, Li writes, breakthroughs have often come from seeing the world (literally) differently: Eratosthenes calculated Earth's circumference using shadows cast in two Egyptian cities at the same moment.

Related Topics: #AI #vision #robotics #autonomous vehicles #satellite navigation #spatial reasoning #Li #top‑tier models

Can machines ever move through space with the same ease animals do? Li thinks the missing link isn’t more data but a real grasp of motion, distance and how things relate physically. Today’s models are great at reading, writing and spotting patterns, yet they trip up the moment you ask them to act in the real world.

Humans, on the other hand, blend perception and meaning almost automatically. The Stanford veteran who helped build ImageNet and started the startup World calls this the next big hurdle. Evolution gave creatures a sense of place roughly half a billion years ago; silicon might be starting to catch up.

Still, the route from algorithmic sight to true embodied spatial reasoning is anything but clear. Li’s essays point out deep limits but stop short of laying out a step-by-step plan. Whether future AI will become genuine “creative partners” by mastering space remains an open question.

I’m excited by the goal, aware of the obstacles, and honestly not sure how it will end up.

Common Questions Answered

What limitation does Li identify in current state‑of‑the‑art AI models regarding interaction with the physical world?

Li points out that while modern AI excels at reading, writing, and pattern recognition in data, it fundamentally struggles to represent or act within the physical world. This limitation prevents models from understanding motion, distance, and spatial relationships needed for real‑world tasks.

How does Li compare human spatial reasoning to AI's capabilities in the article?

Li notes that humans seamlessly integrate perception with meaning, instantly grasping an object's size, weight, and position in space without conscious effort. In contrast, AI lacks this implicit spatial reasoning and cannot intuitively infer physical properties from visual inputs.

Why does Li argue that extending AI models from data to space understanding is more important than simply adding more data?

Li believes the next frontier is teaching machines to comprehend motion, distance, and physical relationships, which data alone cannot provide. By focusing on spatial understanding, AI can move beyond screen‑based tasks to operate effectively in rooms, streets, and orbit.

What is the significance of Li's role as co‑creator of ImageNet and founder of the startup World to his new focus on space understanding?

Li's work on ImageNet established foundational vision models that can label millions of images, demonstrating his expertise in visual perception. His startup World builds on that legacy, aiming to bridge the gap between image recognition and real‑world spatial reasoning.