The ModelScope community has open-sourced a multi-modal alignment unification framework called OneLLM. This framework utilizes a universal encoder and a unified projection module to align multi-modal inputs with LLM. It supports the understanding of various modal data such as images, audio, and videos, and demonstrates strong zero-shot capabilities in tasks like video-to-text, audio-video-to-text, etc. The open-source code of OneLLM has been released on GitHub, where you can obtain related model weights and model creation space.