The OpenBMB team recently launched MiniCPM-o2.6, the latest and most powerful multimodal large language model (MLLM) in the series. The standout feature of MiniCPM-o2.6 is its 800 million parameters, which bring its performance in visual, speech, and multimodal live streaming closer to that of GPT-4o-202405, making it a versatile and efficient choice in the open-source community.

image.png

MiniCPM-o2.6 has powerful input processing capabilities, allowing it to accept various input types such as images, videos, text, and audio, and provide high-quality text and speech output.

This model's speech mode has introduced a bilingual real-time conversation feature, enabling users to configure different voices according to their needs, supporting control over emotion, speed, and style, and even facilitating fun applications like role-playing and voice cloning. These innovations enrich the interactive experience, allowing users to enjoy more natural and fluid communication.

In addition to breakthroughs in speech conversation, MiniCPM-o2.6 has made significant progress in visual processing capabilities. Its powerful OCR (Optical Character Recognition) feature and multilingual support make real-time video understanding more efficient. This exceptional capability has also enabled multimodal live streaming on mobile devices for the first time, allowing users to stream live content on devices like iPads, offering a more interactive and engaging content sharing experience.

Since February 2024, the MiniCPM series has released six versions, with the team aiming to continuously enhance the model's performance and deployment efficiency. This model not only features technological innovations but also represents significant progress in multimodal interactive experiences. Whether for professional applications or everyday entertainment interactions, MiniCPM-o2.6 will become an indispensable smart assistant for users.

Project address: https://github.com/OpenBMB/MiniCPM-o