Recently, the field of artificial intelligence has been focusing on the seamless integration of vision and language, particularly with the advent of large language models (LLMs), which have led to significant advancements. However, for multi-modal AGI systems, the development of vision and visual-language foundational models still lags behind. To bridge this gap, researchers from Nanjing University, OpenGVLab, Shanghai Artificial Intelligence Laboratory, the University of Hong Kong, the Chinese University of Hong Kong, Tsinghua University, the University of Science and Technology of China, and SenseTime Research have proposed an innovative model—InternVL. This model expands the scale of visual foundational models and adapts them for general visual-language tasks. InternVL has demonstrated superior performance across a variety of tasks, including image and video classification, image and video text retrieval, image captioning, visual question answering, and multi-modal dialogue, by outperforming existing methods on 32 general visual-language benchmarks.