The latest research from Google proposes SpatialVLM to address the issue of visual language models lacking spatial reasoning capabilities. By drawing inspiration from human spatial reasoning abilities, researchers have designed SpatialVLM to possess direct spatial reasoning and chain-of-thought capabilities. The researchers trained SpatialVLM using models for open-vocabulary detection, depth estimation, and semantic segmentation, thereby enhancing the model's performance in spatial problems and quantitative estimation. A comprehensive data generation framework was designed to extract entity information and generate a large-scale spatial VQA dataset, enabling the model to perform direct spatial reasoning and chain-of-thought. This research opens up new possibilities for the development of visual language models and brings new advancements to the field of artificial intelligence.