Microsoft has recently released Phi-3.5-vision, a lightweight, multimodal open-source AI model, which is the newest member of the Phi-3 model family designed for applications that require simultaneous processing of text and visual inputs. The Phi-3.5-vision model performs exceptionally well in environments with limited memory or computational resources, supporting a context length of 128K, making it an ideal choice for both commercial and research sectors.

image.png

The Phi-3.5-vision model offers a wide range of functionalities including extensive image understanding, optical character recognition (OCR), chart and table parsing, and summarization of multiple images or video clips. It has demonstrated significant performance improvements in benchmark tests related to image and video processing.

Comprising a system with 4.2 billion parameters, the Phi-3.5-vision model includes an image encoder, connector, projector, and the Phi-3Mini language model. It is trained using high-quality educational data, synthetic data, and rigorously screened public documents to ensure data quality and privacy.

Phi-3.5-vision includes three models:

Phi-3.5Mini Instruct: A lightweight AI model suitable for environments with limited memory or computational resources.

Phi-3.5MoE (Mixture of Experts): Microsoft's first "mixture of experts" model, adept at handling complex tasks.

Phi-3.5Vision Instruct: A multimodal model that integrates text and image processing capabilities.

Key Features

The main features of the Phi-3.5-vision model include image understanding, OCR, chart and table comprehension, multi-image comparison, summarization of multiple images or video clips, efficient inference capabilities, and low latency with memory optimization.

Phi-3.5-vision has performed excellently in multiple benchmark tests such as MMMU, MMBench, TextVQA, video processing capability tests, and the BLINK benchmark, showcasing its robust performance in multimodal and visual tasks.

The release of Microsoft's Phi-3.5-vision model brings new options to the AI field, particularly in edge-side operations and complex visual reasoning. Its open-source nature and optimized design allow it to perform exceptionally well in resource-constrained environments, providing strong support for a variety of AI-driven applications.

Model download link: https://huggingface.co/microsoft/Phi-3.5-vision-instruct