Apple Inc. has recently released a significant update for its multimodal AI model MM1, upgrading it to version MM1.5. This update represents more than just a simple version number change; it's a comprehensive enhancement that showcases the model's increased capabilities across various fields.

The core upgrade of MM1.5 lies in its innovative data processing methods. The model adopts a data-centric training approach, meticulously selecting and optimizing the training dataset. Specifically, MM1.5 incorporates high-definition OCR data and synthetic image descriptions, along with refined visual instruction tuning data. The integration of these data types has notably improved the model's performance in text recognition, image understanding, and execution of visual instructions.

image.png

In terms of model scale, MM1.5 includes multiple versions ranging from 1 billion to 300 billion parameters, including dense and mixture of experts (MoE) variants. Notably, even the smaller-scale models with 1 billion and 3 billion parameters, through carefully designed data and training strategies, achieve impressive performance levels.

image.png

The enhanced capabilities of MM1.5 are primarily reflected in the following areas: text-dense image understanding, visual referencing and localization, multi-image reasoning, video understanding, and mobile UI comprehension. These capabilities enable MM1.5 to be applied in a broader range of scenarios, such as identifying performers and instruments from concert photos, understanding chart data and answering related questions, and locating specific objects in complex scenes.

image.png

image.png

To evaluate the performance of MM1.5, researchers compared it with other advanced multimodal models. The results indicate that MM1.5-1B excels among models with 1 billion parameters, outperforming its counterparts. MM1.5-3B surpasses MiniCPM-V2.0 and is on par with InternVL2 and Phi-3-Vision. Additionally, studies show that both dense and MoE models experience significant performance improvements as their scale increases.

The success of MM1.5 not only demonstrates Apple's research and development strength in the field of artificial intelligence but also points the way forward for the future development of multimodal models. By optimizing data processing methods and model architecture, even smaller-scale models can achieve powerful performance, which is crucial for deploying high-performance AI models on resource-limited devices.

Paper link: https://arxiv.org/pdf/2409.20566