Beijing Zhiyuan Artificial Intelligence Research Institute announces the launch of the native multimodal world model Emu3. This model is based on next-token prediction technology and does not rely on diffusion models or combinatorial methods to achieve understanding and generation across text, image, and video modalities. Emu3 surpasses existing well-known open-source models such as SDXL, LLaVA, and OpenSora in tasks like image generation, video generation, and visual language understanding, showcasing exceptional performance.