Aya Vision 32B is an advanced vision-language model developed by Cohere For AI, boasting 32 billion parameters and supporting 23 languages, including English, Chinese, and Arabic. This model combines the latest multilingual language model Aya Expanse 32B and the SigLIP2 vision encoder, achieving visual and language understanding integration through a multimodal adapter. It excels in the vision-language field, capable of handling complex image and text tasks such as OCR, image captioning, and visual reasoning. The release of this model aims to promote the popularization of multimodal research, providing a powerful tool for global researchers with its open-source weights. The model is licensed under CC-BY-NC and is subject to Cohere For AI's fair use policy.