Cohere, an AI startup, released Aya Vision, a multimodal "open" AI model, this week through its non-profit research lab. The lab claims the model is industry-leading.

QQ_1741243943019.png

Aya Vision performs multiple tasks, including image captioning, answering questions about photos, translating text, and generating summaries in 23 major languages. Cohere is making Aya Vision freely available via WhatsApp, aiming to make this technological breakthrough more accessible to researchers worldwide.

QQ_1741243964274.png

Cohere notes in its blog that while AI has made significant progress, a large gap remains in model performance across different languages, especially in multimodal tasks involving text and images. "Aya Vision aims to help bridge this gap."

Aya Vision comes in two versions: Aya Vision 32B and Aya Vision 8B. The more advanced Aya Vision 32B, dubbed a "new frontier," outperforms models twice its size in some visual understanding benchmarks, including Meta's Llama-3.290B Vision. Meanwhile, Aya Vision 8B surpasses some models ten times its size in certain evaluations.

QQ_1741243979235.png

Both models are available on the AI development platform Hugging Face under a Creative Commons 4.0 license, subject to Cohere's acceptable use addendum and are not for commercial use.

Cohere states that Aya Vision was trained using a "diverse" English dataset, which the lab translated and then used synthetic annotations for training. Synthetic annotations are AI-generated labels that help the model understand and interpret data during training. While synthetic data has potential drawbacks, competitors like OpenAI are increasingly using it to train models.

Cohere points out that training Aya Vision with synthetic annotations allowed them to reduce resource usage while still achieving competitive performance. "This showcases our commitment to efficiency, achieving more with fewer computational resources."

To further support the research community, Cohere also released a new benchmark suite, AyaVisionBench, designed to test models' capabilities in combined vision and language tasks, such as identifying differences between two images and converting screenshots to code.

The AI industry currently faces a so-called "evaluation crisis," largely stemming from the widespread use of popular benchmarks whose overall scores poorly correlate with the capabilities relevant to most AI users' tasks. Cohere claims AyaVisionBench offers a "broad and challenging" framework for evaluating models' cross-lingual and multimodal understanding.

Official blog: https://cohere.com/blog/aya-vision

Key Highlights:

🌟 Aya Vision, touted by Cohere as industry-best, performs various language and vision tasks.

💡 Aya Vision comes in 32B and 8B versions, outperforming larger competitor models.

🔍 Cohere also released a new benchmark, AyaVisionBench, to address AI model evaluation challenges.