Stanford University professor Fei-Fei Li, known as the "Godmother of AI," and her team recently published a study on a multimodal large model called "Spatial Intelligence," revealing that these models have developed preliminary capabilities in memory and recall of spatial information, showing potential for forming localized world models.

The research team developed a tool for assessing visual spatial intelligence called VSI-Bench, which includes over 5,000 high-quality question-and-answer pairs based on 288 real videos. The test videos cover living spaces, professional settings, and industrial scenes across various geographic regions.

QQ20241223-144615.png

The research results indicate that while the overall performance of multimodal models is still below that of humans, they have reached or approached human-level performance on certain tasks. For instance, Gemini-1.5Pro excelled in tasks such as absolute distance and room size estimation, while some open-source models like the LLaVA series also achieved competitive results.

The study also pointed out that using cognitive maps to assist spatial reasoning can significantly enhance the model's performance on spatial tasks, with accuracy improving by 10 percentage points. This suggests that explicitly generating cognitive maps can help overcome the models' limitations in spatial understanding.

Fei-Fei Li stated that spatial intelligence is a key capability for AI to understand the physical world and is crucial for achieving Artificial General Intelligence (AGI). She believes that spatial intelligence will become the next frontier in the field of AI, with significant breakthroughs expected by 2025.

In September of this year, the company World Labs, founded by Fei-Fei Li, announced its official launch, focusing on developing AI models with spatial intelligence. The company has secured investments from well-known institutions, including Nvidia, a16z, and Adobe, and is currently valued at over $1 billion.

This research and its applications mark a critical advancement in AI technology from two-dimensional information processing to three-dimensional spatial perception, with the potential for widespread applications in navigation, robotic interaction, augmented reality, and more, paving the way for further development of artificial intelligence.