The data to be translated: Tsinghua KEG & Zhipu AI has released the multi-modal large model CogVLM, a new generation SOTA model that achieves deep integration of visual-linguistic features. CogVLM-17B has achieved SOTA or second-place results on multiple datasets, demonstrating outstanding performance. The model's architecture includes a ViT encoder, an MLP adapter, a pre-trained large language model, and a visual expert module. CogVLM has been pre-trained on 1.5 billion image-text pairs and has shown satisfactory results on multi-modal benchmarks. Compared to other models, CogVLM exhibits excellent performance in image understanding, model hallucination, and text recognition. Additionally, the model has been open-sourced to promote further development of multi-modal models in research and application fields. This release aims to advance the research of multi-modal foundation models, to achieve multi-modal understanding, and to lay a solid foundation for intelligent applications.