Yesterday, the Shanghai AI Lab brought us a major surprise — they have open-sourced a multimodal large language model named InternLM-XComposer-2.5 (abbreviated as IXC-2.5). This is not an ordinary model; it showcases extraordinary abilities in various aspects, particularly in ultra-high-resolution image understanding, fine-grained video understanding, and multi-turn image dialogues, leaving a profound impression.
What is even more impressive is that IXC-2.5 has been specially optimized for web page creation and mixed text-image layout, which is undoubtedly a huge boon for creators who need to display rich content on the web. Moreover, the open-source nature of IXC-2.5 fills a gap in the domestic multimodal LLM field.
Characteristics of the IXC-2.5 Model:
Long Context Handling: IXC-2.5 natively supports 24K token input and can be expanded to 96K, meaning it can handle extremely long text and image inputs, providing users with greater creative space.
Diverse Visual Capabilities: It not only supports ultra-high-resolution image understanding but also performs fine-grained video understanding and multi-turn multi-image dialogues, which were unimaginable in previous models.
Generation Ability: IXC-2.5 can generate web pages and high-quality text-image articles, raising the bar for the combination of text and images.
Model Architecture: It includes a lightweight visual encoder, a large language model, and some LoRA alignment techniques. The combination of these technologies significantly improves the performance of IXC-2.5.
Test Results: In 28 benchmark tests, IXC-2.5 exceeded existing open-source models in 16 tests and matched or surpassed GPT-4V and Gemini Pro in the remaining 16, which is sufficient to prove its formidable strength.
Multi-turn Dialogue Demonstration
The development of IXC-2.5 is a masterpiece by a joint team from the Shanghai AI Lab, the Chinese University of Hong Kong, SenseTime Group, and Tsinghua University. The model was designed with the intention of supporting long-context input and output to handle increasingly complex text-image understanding and creation tasks.
In terms of image processing, IXC-2.5 adopts a unified dynamic image segmentation strategy that can adapt to images of any resolution and aspect ratio. For video processing, it can concatenate frames along the short edge to form high-resolution images while retaining frame indices to provide temporal relationships.
Text and Image Mixed Layout Demonstration
In the pre-training phase, IXC-2.5 extends the context window to 96K through position encoding extrusion, demonstrating excellent capabilities in human-computer interaction and content creation. During the supervised fine-tuning phase, IXC-2.5 is trained with specific datasets to handle large images and videos.
Additionally, IXC-2.5 expands its application in web generation, able to automatically construct web pages based on visual screenshots, free-form commands, or resume documents. In terms of text-image article creation, IXC-2.5 proposes an extensible process through the combination of various technologies to generate high-quality and stable text-image articles.
After a series of comprehensive experiments, IXC-2.5 has performed exceptionally well in multiple benchmark tests, showcasing strong competitiveness in tasks such as video understanding, structured high-resolution image understanding, multi-turn multi-image dialogue, and general visual question answering.
The open-source of IXC-2.5 is not only a technological leap but also a significant contribution to the entire AI field. It shows the limitless possibilities of multimodal LLMs and opens up new paths for future AI applications.
Project Address: https://top.aibase.com/tool/internlm-xcomposer-2-5
Paper Address: https://arxiv.org/pdf/2407.03320