Recently, Tokyo-based startup Rhymes AI has launched their first artificial intelligence model, Aria. The company claims that Aria is the world's first open-source multi-modal Mixture of Experts (MoE) model. This model not only has the capability to handle multiple input modalities but also boasts performance on par with, if not superior to, some well-known commercial models.

The design philosophy behind Aria is to provide exceptional understanding and processing capabilities across various input forms such as text, code, images, and videos. Unlike traditional Transformer models, the MoE model replaces its feedforward layers with multiple specialized experts. When processing each input token, a routing module selects a subset of experts to activate, thereby enhancing computational efficiency and reducing the number of activated parameters per token.

image.png

The decoder of Aria can activate up to 3.5 billion parameters per text token, with the entire model housing 24.9 billion parameters. To handle visual inputs, Aria has also designed a lightweight visual encoder with 438 million parameters, capable of converting visual inputs of various lengths, sizes, and aspect ratios into visual tokens. Additionally, Aria's multi-modal context window reaches 64,000 tokens, meaning it can handle longer input data.

image.png

In terms of training, Rhymes AI is divided into four stages: pre-training with text data, introducing multi-modal data, training with long sequences, and finally fine-tuning. Throughout this process, Aria used a total of 6.4 trillion text tokens and 400 billion multi-modal tokens for pre-training, with data sourced from well-known datasets like Common Crawl and LAION, and enhanced with some synthetic data.

According to relevant benchmarks, Aria outperforms models like Pixtral-12B and Llama-3.2-11B in multiple multi-modal, language, and programming tasks. Due to fewer activated parameters, its inference costs are also lower.

Moreover, Aria performs well in handling videos with captions or multi-page documents. Its ability to understand long videos and documents surpasses other open-source models like GPT-4o mini and Gemini1.5Flash.

image.png

To facilitate usage, Rhymes AI has released Aria's source code under the Apache2.0 license on GitHub, allowing for both academic and commercial use. They also provide a training framework that enables fine-tuning of Aria with various data sources and formats on a single GPU. Notably, Rhymes AI has partnered with AMD to optimize model performance, showcasing a search application named BeaGo that runs on AMD hardware, offering comprehensive text and image AI search results.

Key Points:

🌟 Aria is the world's first open-source multi-modal Mixture of Experts AI model.

💡 Aria excels in handling various inputs such as text, images, and videos, outperforming many peer models.

🤝 Rhymes AI collaborates with AMD to optimize model performance and launches the feature-rich BeaGo search application.