Beijing Zhiyuan Artificial Intelligence Research Institute announces the launch of the native multimodal world model Emu3. This model is based on next-token prediction technology and does not rely on diffusion models or combinatorial methods to achieve understanding and generation across text, image, and video modalities. Emu3 surpasses existing well-known open-source models such as SDXL, LLaVA, and OpenSora in tasks like image generation, video generation, and visual language understanding, showcasing exceptional performance.
Apple Inc. recently launched a significant update for its multimodal AI model MM1, upgrading it to version MM1.5. This upgrade represents more than just a simple change in version number; it is a comprehensive enhancement of capabilities, showcasing the model's stronger performance across various fields. The core upgrade of MM1.5 lies in its innovative data processing methods. The model utilizes a data-centric training approach, carefully selecting and optimizing the training dataset. Specifically, MM1.5 employs high-definition OCR data and synthetic image descriptions, along with optimizations.