Phi-3 Vision is a lightweight, state-of-the-art open multimodal model built on a dataset encompassing synthetic data and curated publicly available websites. It focuses on exceptionally high-quality reasoning-intensive data for both text and vision. Belonging to the Phi-3 family of models, the multimodal version supports a 128K context length (in tokens) and has undergone rigorous enhancement processes, combining supervised fine-tuning and direct preference optimization to ensure precise instruction following and robust safety measures.