The Zero-One-Everything Yi-VL Multimodal Language Model is the latest addition to the Zero-One-Everything Yi family of models, excelling in both visual comprehension and conversational generation. The Yi-VL model has achieved top results on both the English dataset MMMU and the Chinese dataset CMMMU, demonstrating its prowess in complex interdisciplinary tasks. The Yi-VL-34B model surpassed other large multimodal models with a 41.6% accuracy rate in the new multimodal benchmark test MMMU, showcasing its robust interdisciplinary knowledge comprehension and application capabilities. Built on the open-source LLaVA architecture, the Yi-VL model includes the Vision Transformer (ViT), Projection modules, and large-scale language models Yi-34B-Chat and Yi-6B-Chat. The ViT is used for image encoding, the Projection modules enable the alignment of image features with text feature spaces, and the large-scale language models provide powerful language comprehension and generation capabilities.