Translated data: Peking University has released Video-LLaVA, a large-scale visual language model capable of handling both image and video inputs simultaneously. By using the LanguageBind encoder to pre-align visual inputs in advance, it addresses the misalignment issue. In terms of video understanding, Video-LLaVA outperforms Video-ChatGPT across multiple datasets. The pre-alignment of visual representations not only enhances the performance of image question answering but also yields significant benefits in other areas.