Rhymes AI Launches First Open Source Multimodal AI Model Aria, Outperforming GPT-4o Mini and Other Leading AI Models

AIbase基地

Published inAI News · 5 min read · Oct 11, 2024

868

Recently, Tokyo-based startup Rhymes AI has launched their first artificial intelligence model, Aria. The company claims that Aria is the world's first open-source multi-modal Mixture of Experts (MoE) model. This model not only has the capability to handle multiple input modalities but also boasts performance on par with, if not superior to, some well-known commercial models.

The design philosophy behind Aria is to provide exceptional understanding and processing capabilities across various input forms such as text, code, images, and videos. Unlike traditional Transformer models, the MoE model replaces its feedforward layers with multiple specialized experts. When processing each input token, a routing module selects a subset of experts to activate, thereby enhancing computational efficiency and reducing the number of activated parameters per token.

The decoder of Aria can activate up to 3.5 billion parameters per text token, with the entire model housing 24.9 billion parameters. To handle visual inputs, Aria has also designed a lightweight visual encoder with 438 million parameters, capable of converting visual inputs of various lengths, sizes, and aspect ratios into visual tokens. Additionally, Aria's multi-modal context window reaches 64,000 tokens, meaning it can handle longer input data.

In terms of training, Rhymes AI is divided into four stages: pre-training with text data, introducing multi-modal data, training with long sequences, and finally fine-tuning. Throughout this process, Aria used a total of 6.4 trillion text tokens and 400 billion multi-modal tokens for pre-training, with data sourced from well-known datasets like Common Crawl and LAION, and enhanced with some synthetic data.

According to relevant benchmarks, Aria outperforms models like Pixtral-12B and Llama-3.2-11B in multiple multi-modal, language, and programming tasks. Due to fewer activated parameters, its inference costs are also lower.

Moreover, Aria performs well in handling videos with captions or multi-page documents. Its ability to understand long videos and documents surpasses other open-source models like GPT-4o mini and Gemini1.5Flash.

To facilitate usage, Rhymes AI has released Aria's source code under the Apache2.0 license on GitHub, allowing for both academic and commercial use. They also provide a training framework that enables fine-tuning of Aria with various data sources and formats on a single GPU. Notably, Rhymes AI has partnered with AMD to optimize model performance, showcasing a search application named BeaGo that runs on AMD hardware, offering comprehensive text and image AI search results.

Key Points:
🌟 Aria is the world's first open-source multi-modal Mixture of Experts AI model.
💡 Aria excels in handling various inputs such as text, images, and videos, outperforming many peer models.
🤝 Rhymes AI collaborates with AMD to optimize model performance and launches the feature-rich BeaGo search application.

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

1.Tencent's Hunyuan3D-PolyGen boosts 3D modeling efficiency by 70% with BPT tech. 2.Alibaba's HumanOmniV2 achieves 69.33% accuracy in multilingual input. 3.DingTalk AI processes 1k tasks/hour with 'spreadsheet-as-document'. 4.Baidu PaddleOCR3.1 improves 37-language recognition by 30%. 5.Microsoft Deep Research opens API. 6.HKPolyU & OPPO's DLoRAL speeds video enhancement 10x. 7.Google opens MCP Toolbox for SQL. 8.Microsoft Win11 to add AI dynamic....

Google Open Sources MCP Toolbox for Databases: Unlock the Infinite Possibilities of AI and Databases with 10 Lines of Code

Google releases the open-source tool MCP Toolbox for Databases, simplifying the integration of AI agents with SQL databases. The tool connects to a database with just 10 lines of code and supports secure mechanisms such as connection pool management, authentication, and schema introspection. It is compatible with various Google Cloud databases. As an open-source project, it lowers the development barrier, but currently mainly supports Google ecosystem databases. Future expansion of compatibility may be needed. This tool has the potential to become a standard component for AI development, driving intelligent data processing.

One-click to HD! Hong Kong Polytechnic University Collaborates with OPPO to Open-source DLoRAL, Bringing Revolutionary Breakthroughs in Video Super-resolution

PolyU & OPPO developed DLoRAL, a video super-resolution framework using dual LoRA: CLoRA for temporal consistency and DLoRA for spatial details. Its two-stage training balances quality and speed (10× faster inference), with open-source code/models available. Limited in tiny text recovery but promising for real-time applications.....

DLoRAL: Open-Source Video HD Enhancement Framework Developed by Hong Kong Polytechnic University and OPPO

Hong Kong Polytechnic University and OPPO Research Institute jointly released the open-source video super-resolution framework DLoRAL, which generates high-definition videos in one step using diffusion models. The framework adopts a dual LoRA architecture: C-LoRA maintains temporal consistency between frames, while D-LoRA enhances spatial details. Through a two-stage training strategy, it optimizes temporal coherence and high-frequency information. Compared to traditional methods, DLoRAL improves inference speed by 10 times while maintaining smoothness, significantly enhancing image details, and providing an efficient open-source solution for video HD enhancement.

Aliyun Open-Sources Network Agent WebSailor, Surpassing Numerous Closed-Source Models

Aliyun open-sources the network agent WebSailor. Its 32B and 72B versions performed well in the BrowseComp evaluation, surpassing multiple closed-source models, ranking just behind OpenAI DeepResearch. The project has been released on GitHub with construction plans and datasets, promoting open innovation in the AI field and providing developers with a smarter web interaction tool.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Rhymes AI Launches First Open Source Multimodal AI Model Aria, Outperforming GPT-4o Mini and Other Leading AI Models

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Alibaba Tongyi Opens Source Audio Generation Model ThinkSound; Google Veo3 Generates Images into Videos; Feishu Announces Several New AI Products

Kunlun Wildfire Launches Skywork-R1V 3.0: Cross-modal Reasoning Capabilities Approaching Those of Human Experts!

Ali Open Sources WebSailor with Strong Reasoning and Retrieval Capabilities

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

Ali HumanOmniV2 Launches with a Shock: The New King of Multimodal AI, Accuracy Surges to 69.33%

Breaking Traditions, Moliang Technology Secures Millions in Funding to Drive a New Era of Multimodal Tactile Sensors!

Google Open Sources MCP Toolbox for Databases: Unlock the Infinite Possibilities of AI and Databases with 10 Lines of Code

One-click to HD! Hong Kong Polytechnic University Collaborates with OPPO to Open-source DLoRAL, Bringing Revolutionary Breakthroughs in Video Super-resolution

DLoRAL: Open-Source Video HD Enhancement Framework Developed by Hong Kong Polytechnic University and OPPO

Aliyun Open-Sources Network Agent WebSailor, Surpassing Numerous Closed-Source Models