Recently, the VITA-MLLM team announced the launch of VITA-1.5, an upgraded version based on VITA-1.0, aimed at enhancing the real-time performance and accuracy of multimodal interactions. VITA-1.5 not only supports English and Chinese but also achieves significant improvements across various performance metrics, providing users with a smoother interactive experience.

79e9529425a7e3b44d98a5bfa01d644e.png

In VITA-1.5, the interaction delay has been significantly reduced from the original 4 seconds to just 1.5 seconds, making it nearly imperceptible for users during voice interactions. Additionally, this version shows remarkable improvements in multimodal performance; evaluations indicate that VITA-1.5's average performance in several benchmark tests, including MME, MMBench, and MathVista, has increased from 59.8 to 70.8, showcasing its exceptional capabilities.

VITA-1.5 has also undergone extensive optimization in its speech processing capabilities. The error rate of its Automatic Speech Recognition (ASR) system has been significantly reduced from 18.4 to 7.5, leading to more accurate understanding and response to voice commands. Furthermore, VITA-1.5 introduces an end-to-end Text-to-Speech (TTS) module that can directly accept embeddings from large language models (LLMs) as input, enhancing the naturalness and coherence of speech synthesis.

To ensure a balanced multimodal capability, VITA-1.5 employs a progressive training strategy, minimizing the impact of the new speech processing module on visual-language performance, with image understanding slightly decreasing from 71.3 to 70.8. Through these technological innovations, the team has further pushed the boundaries of real-time visual and voice interaction, laying the foundation for future intelligent interactive applications.

image.png

In terms of usage, developers can quickly get started with VITA-1.5 through simple command-line operations, which also include basic and real-time interactive demonstrations. Users will need to prepare some essential modules, such as the Voice Activity Detection (VAD) module, to enhance the real-time interaction experience. Additionally, VITA-1.5 will open-source its code, facilitating participation and contributions from a wide range of developers.

The launch of VITA-1.5 marks another significant advancement in the field of interactive multimodal large language models, reflecting the team's relentless pursuit of technological innovation and user experience.

Project link: https://github.com/VITA-MLLM/VITA?tab=readme-ov-file

Key Highlights:

🌟 VITA-1.5 significantly reduces interaction delay from 4 seconds to 1.5 seconds, greatly enhancing user experience.

📈 Improved multimodal performance, with average performance in multiple benchmark tests increasing from 59.8 to 70.8.

🔊 Enhanced speech processing capabilities, with ASR error rate decreasing from 18.4 to 7.5, resulting in more accurate speech recognition.