The Beijing Academy of Artificial Intelligence (BAAI) has announced the launch of Emu3, a native multimodal world model. This model, based on next-token prediction technology, can comprehend and generate text, images, and videos without relying on diffusion models or combinatorial methods. Emu3 surpasses existing well-known open-source models such as SDXL, LLaVA, and OpenSora in tasks like image generation, video generation, and visual language understanding, demonstrating exceptional performance.

WeChat Screenshot_20241021135044.png

At the core of the Emu3 model is a powerful visual tokenizer that converts videos and images into discrete tokens, which can be fed into the model alongside discrete tokens from a text tokenizer. The model's output tokens can be converted into text, images, and videos, providing a unified research paradigm for Any-to-Any tasks. Additionally, the flexibility of Emu3's next-token prediction framework allows for seamless application of Direct Preference Optimization (DPO) to autoregressive visual generation, aligning the model with human preferences.

WeChat Screenshot_20241021135121.png

The research results of Emu3 demonstrate that next-token prediction can serve as a powerful paradigm for multimodal models, enabling large-scale multimodal learning beyond language and achieving advanced performance in multimodal tasks. By converging complex multimodal designs to the token level, Emu3 unlocks significant potential for large-scale training and inference. This achievement paves a promising path for building a multimodal AGI.

Currently, the key technologies and models of Emu3 have been open-sourced, including SFT-trained chat models and generative models, along with corresponding SFT training codes, to facilitate further research and community building and integration.

Code: https://github.com/baaivision/Emu3

Project Page: https://emu.baai.ac.cn/

Models: https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f