Redefining Multimodal AI! Zhiyuan Releases the Native Multimodal World Model Emu3

AIbase基地

Published inAI News · 3 min read · Oct 21, 2024

218

The Beijing Academy of Artificial Intelligence (BAAI) has announced the launch of Emu3, a native multimodal world model. This model, based on next-token prediction technology, can comprehend and generate text, images, and videos without relying on diffusion models or combinatorial methods. Emu3 surpasses existing well-known open-source models such as SDXL, LLaVA, and OpenSora in tasks like image generation, video generation, and visual language understanding, demonstrating exceptional performance.

WeChat Screenshot_20241021135044.png

At the core of the Emu3 model is a powerful visual tokenizer that converts videos and images into discrete tokens, which can be fed into the model alongside discrete tokens from a text tokenizer. The model's output tokens can be converted into text, images, and videos, providing a unified research paradigm for Any-to-Any tasks. Additionally, the flexibility of Emu3's next-token prediction framework allows for seamless application of Direct Preference Optimization (DPO) to autoregressive visual generation, aligning the model with human preferences.

WeChat Screenshot_20241021135121.png

The research results of Emu3 demonstrate that next-token prediction can serve as a powerful paradigm for multimodal models, enabling large-scale multimodal learning beyond language and achieving advanced performance in multimodal tasks. By converging complex multimodal designs to the token level, Emu3 unlocks significant potential for large-scale training and inference. This achievement paves a promising path for building a multimodal AGI.

Currently, the key technologies and models of Emu3 have been open-sourced, including SFT-trained chat models and generative models, along with corresponding SFT training codes, to facilitate further research and community building and integration.

Code: https://github.com/baaivision/Emu3

Project Page: https://emu.baai.ac.cn/

Models: https://huggingface.co/collections/BAAI/emu3-66f4e64f70850ff358a2e60f

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Redefining Multimodal AI! Zhiyuan Releases the Native Multimodal World Model Emu3

AIbase基地

This article is from AIbase Daily