The Emu3 team from the Beijing Academy of Artificial Intelligence (BAAI) has released a new multimodal model, Emu3, which is trained solely on next-token prediction, revolutionizing traditional diffusion and compositional model architectures and achieving state-of-the-art performance in both generative and perceptual tasks.

Next-token prediction has long been regarded as a promising path towards Artificial General Intelligence (AGI), but it has underperformed in multimodal tasks. Currently, the multimodal field is dominated by diffusion models (such as Stable Diffusion) and compositional models (such as the combination of CLIP and LLM). The Emu3 team tokenizes images, text, and video into discrete spaces and trains a single Transformer model from scratch on mixed multimodal sequences, unifying multimodal tasks without relying on diffusion or compositional architectures.

image.png

Emu3 outperforms existing task-specific models in both generative and perceptual tasks, surpassing flagship models like SDXL and LLaVA-1.6. Emu3 can also generate high-fidelity videos by predicting the next token in a video sequence. Unlike Sora, which uses a video diffusion model to generate videos from noise, Emu3 generates videos causally by predicting the next token in the video sequence. The model can simulate certain aspects of the real world, such as environments, people, and animals, and predict what will happen next given the video context.

image.png

Emu3 simplifies the design of complex multimodal models, focusing on tokens, thereby unlocking significant scaling potential during training and inference. Research results indicate that next-token prediction is an effective way to build a universal multimodal intelligence beyond language. To support further research in this field, the Emu3 team has open-sourced key technologies and models, including a powerful visual tokenizer that converts videos and images into discrete tokens, which was previously unavailable publicly.

Emu3's success points to the future direction of multimodal model development and brings new hope for achieving AGI.

Project link: https://github.com/baaivision/Emu3