Large Language Models (LLMs) have made significant strides in artificial intelligence, particularly in multimodal fusion. A collaborative team from Huazhong University of Science and Technology, ByteDance, and the University of Hong Kong recently introduced a novel multimodal generation framework—Liquid—designed to address the limitations of current mainstream multimodal models in visual processing.
Traditional multimodal LLMs rely on complex external visual modules, increasing system complexity and limiting scalability. Liquid's innovation lies in its use of VQGAN as an image tokenizer, eliminating the need for external visual components. By encoding images into discrete visual tokens, the model can directly share the vocabulary with text tokens, enabling "native" visual understanding and generation capabilities.
Research reveals that Liquid not only reduces training costs but also unveils the scaling laws of multimodal capabilities with LLMs. Experiments were conducted on LLMs of varying sizes (0.5B to 32B parameters), demonstrating that as model size increases, performance and generation quality in visual generation tasks follow scaling laws consistent with language tasks. Even more exciting is the bidirectional promotion between visual understanding and generation tasks, where both can be jointly optimized through a shared representation space.
Liquid's design embodies minimalism, treating images and text equally with a unified processing framework. During construction, the research team utilized 30M text data and 30M image-text pairs, laying the foundation for the model's multimodal training. Final experimental results show that Liquid exhibits superior performance in multimodal understanding, image generation, and pure text tasks, with significantly higher semantic consistency between generated images and text compared to other autoregressive models.
Liquid offers a new architectural design for general-purpose multimodal intelligence, suggesting a potentially more efficient and flexible evolution of artificial intelligence in multimodal fusion.
Paper link: https://arxiv.org/pdf/2412.04332