Recently, Apple's AI research team introduced their new generation of multi-modal large language models (MLLMs) family — MM1.5. This series of models can integrate various data types such as text and images, showcasing new capabilities of AI in understanding complex tasks. Tasks like visual question answering, image generation, and multi-modal data interpretation can be better addressed with the help of these models.

image.png

One major challenge for multi-modal models is how to achieve effective interaction between different data types. Previous models often struggled with tasks involving text-rich images or fine-grained visual tasks. Therefore, Apple's research team introduced an innovative data-centric approach in the MM1.5 model, utilizing high-resolution OCR data and synthetic image descriptions to enhance the model's understanding capabilities.

image.png

This approach not only surpassed previous models in visual understanding and localization tasks but also introduced two specialized versions: MM1.5-Video and MM1.5-UI, dedicated to video understanding and mobile interface analysis, respectively.

The training of the MM1.5 model is divided into three main stages.

The first stage is large-scale pre-training, using 2 billion image-text pairs, 600 million interleaved image-text documents, and 2 trillion text-only tokens.

The second stage involves continuous pre-training with 45 million high-quality OCR data and 7 million synthetic descriptions to further enhance performance on text-rich image tasks.

Finally, in the supervised fine-tuning stage, the model is optimized using carefully selected single-image, multi-image, and text-only data, making it more adept at fine-grained visual referencing and multi-image reasoning.

After a series of evaluations, the MM1.5 model performed excellently in multiple benchmark tests, especially in processing text-rich image understanding, showing a 1.4-point improvement over previous models. Additionally, even the MM1.5-Video model, dedicated to video understanding, reached a leading level in related tasks thanks to its strong multi-modal capabilities.

The MM1.5 model family not only sets new benchmarks for multi-modal large language models but also demonstrates its potential in various applications, from general image-text understanding to video and user interface analysis, all with outstanding performance.

Key Highlights:

🌟 **Model Variants**: Includes dense models and MoE models with parameters ranging from 1 billion to 300 billion, ensuring scalability and flexible deployment.

📊 **Training Data**: Utilizes 2 billion image-text pairs, 600 million interleaved image-text documents, and 2 trillion text-only tokens.

🚀 **Performance Improvement**: Achieved a 1.4-point improvement in benchmark tests focused on text-rich image understanding compared to previous models.