Chinese Visual and Speech Open Source Model VITA-1.5 Released with GPT-4o Level Advanced Speech and Visual Capabilities
Recently, significant progress has been made in multimodal large language models (MLLMs), particularly in the integration of visual and text modalities. However, with the increasing prevalence of human-computer interaction, the importance of the speech modality has become more prominent, especially in multimodal dialogue systems. Speech is not only a key medium for information transmission but also significantly enhances the naturalness and convenience of interactions. Nevertheless, due to the inherent differences between visual and speech data, integrating them into MLLMs is not an easy task. For example, visual data conveys spatial information, while speech data conveys information in a temporal sequence.