The latest release from the Alibaba team, mPLUG-Owl3 is a general-purpose multi-modal large model, with its core capability being the understanding of long image sequences. By introducing a hyper attention module, mPLUG-Owl3 can efficiently process visual and language information, achieving in-depth understanding and communication of multi-modal data such as images and videos. This model has made significant breakthroughs in inference efficiency, image processing capabilities, and the application of multi-modal knowledge, particularly in video understanding, where it can 'watch' a 2-hour movie in 4 seconds and accurately answer related questions.