VITA-1.5 is an open-source multimodal large language model designed to enable near real-time visual and speech interaction. It significantly reduces interaction latency and enhances multimodal performance, providing users with a smoother interaction experience. The model supports both English and Chinese and is applicable to various scenarios, including image recognition, speech recognition, and natural language processing. Its key advantages include efficient speech processing capabilities and robust multimodal understanding.