In the field of artificial intelligence, 3D vision and spatial understanding technologies are becoming key drivers of embodied intelligence, autonomous navigation, and virtual reality applications. In March 2025, Hangzhou Qunhe Technology announced the official open-sourcing of its self-developed 3D vision large language model, SpatialLM, at the GTC2025 global conference, attracting widespread attention from the industry.

QQ_1744766332372.png

This model, with its powerful spatial cognition capabilities and low-cost data processing methods, has brought revolutionary breakthroughs to robot training, architectural design, and AR/VR fields. AIbase, based on the latest information, compiles and deeply analyzes the technical highlights and industry impact of SpatialLM.

SpatialLM: From Phone Videos to Physically Accurate 3D Scenes

SpatialLM is a large language model specifically designed for 3D spatial understanding. It can quickly generate physically accurate 3D scene layouts based on videos captured by ordinary mobile phones or cameras. Compared to traditional methods that rely on expensive lidar or specialized equipment, SpatialLM significantly lowers the data acquisition threshold by processing multi-source point cloud data (such as monocular video sequences, RGBD images, or LiDAR sensor data). The model accurately identifies architectural elements in the scene (such as walls and windows) and the semantic bounding boxes of objects (such as "sofa – 1.8 meters long – 0.5 meters from the wall"), and outputs them in a structured scripting language, giving machines spatial cognitive abilities similar to humans.

Its core technology is based on MASt3R-SLAM, which decomposes videos into frames, extracts spatial details, and generates high-density 3D point clouds. Then, the point cloud encoder converts the data into compact feature vectors, and the large language model (LLM) further generates scene code, ensuring that the output 3D layout conforms to physical rules (such as "furniture cannot be suspended," "passage width ≥ 0.8 meters"). This multi-modal architecture effectively bridges the gap between unstructured 3D geometric data and structured representations, providing high-level semantic understanding for complex scene analysis.

Open-Source Empowerment: Lowering the Barrier to Embodied Intelligence Development

The SpatialLM open-sourced by Qunhe Technology offers two model versions: SpatialLM-Llama-1B based on Llama and SpatialLM-Qwen-0.5B based on Qwen, with parameter scales of 100 million and 50 million respectively. Compared to current LLMs with often tens of billions of parameters, these are lightweight and efficient. The model has been made available to global developers on platforms such as Hugging Face, GitHub, and Modu Community, along with detailed tutorials and test datasets (such as SpatialLM-Testset, containing 107 point cloud data sets reconstructed from monocular RGB videos). Developers can run inference using simple Python scripts and view 3D layout results using visualization tools (such as Rerun).

The significance of this open-source initiative is that it provides a basic training framework for the embodied intelligence field. Zhou Zihang, chief scientist of Qunhe Technology, stated: "SpatialLM aims to help robot companies without model development capabilities quickly improve their spatial understanding abilities through fine-tuning." Combined with Qunhe's previously open-sourced spatial intelligence platform SpatialVerse, SpatialLM can transform real-world scenes into virtual training environments, generating billions of simulated scenes, significantly reducing the cost and risk of robot training.

Wide Applications: From Robotics to Architectural Design

SpatialLM has a wide range of applications. In the field of embodied intelligence, it supports robots in navigating, avoiding obstacles, and performing tasks in complex environments, providing core technical support for smart homes and service robots. In architectural design and planning, the model can analyze building point cloud data, automatically identify structures such as walls and windows, and assist in efficient design. Furthermore, in education and training, SpatialLM can be used to develop 3D modeling teaching software to help students intuitively understand spatial relationships. In AR/VR and game development, its virtual scene generation capabilities provide a low-cost solution for immersive experiences.

The open-sourcing of SpatialLM not only demonstrates Qunhe Technology's technical accumulation in the field of spatial intelligence but also promotes the popularization and innovation of 3D vision technology. Compared to models such as Meta's SceneScript, SpatialLM has stronger versatility with ordinary videos as input, and future plans include iterating natural language interaction and scene interaction functions to further enhance the model's practicality.

Project: https://huggingface.co/manycore-research/SpatialLM-Llama-1B