Phi-3.5-vision is a lightweight, next-generation multimodal model developed by Microsoft. It is built on a dataset that includes synthetic data and curated publicly available websites, focusing on high-quality, dense reasoning data for both text and visual inputs. This model belongs to the Phi-3 family and has undergone rigorous enhancement processes, combining supervised fine-tuning with direct preference optimization to ensure precise instruction following and robust safety measures.