AI Daily: Kimi Releases Multimodal Image Understanding Model API; Zhou Hongyi Stars in AI Short Drama Shooting; MiniMax-01 Series Models Open Source; Xinghuo Simultaneous Translation Voice Model Released

Welcome to the 【AI Daily】 column! Here is your daily guide to exploring the world of artificial intelligence. Every day, we present you with the hottest topics in the AI field, focusing on developers, helping you gain insights into technology trends and innovative AI product applications.

Fresh AI products Click to Learn More: https://top.aibase.com/

1. Kimi's Multimodal Image Understanding Model API Released

On January 15, 2025, Beijing's Dark Side Technology Co., Ltd. officially launched the new multimodal image understanding model moonshot-v1-vision-preview. This model further enhances the multimodal capabilities of the existing moonshot-v1 series, aiming to help Kimi better understand the world. This Vision model boasts exceptional image recognition capabilities, able to identify complex details and distinguish similar objects, excelling particularly in OCR text recognition and image understanding, surpassing the accuracy of traditional software.

【AiBase Summary:】
🖼️ The Vision model has powerful image recognition capabilities, accurately distinguishing complex details and similar objects.
📄 It excels in OCR text recognition and image understanding, with the ability to recognize messy handwritten content exceeding that of ordinary software.
💬 The model supports multi-turn dialogue and tool invocation features, offering flexible usage, but does not support online search.

2. MiniMax Releases New Open Source MiniMax-01 Series Model

On January 15, 2025, MiniMax launched its new open-source series model MiniMax-01, which includes the basic language model MiniMax-Text-01 and the visual multimodal model MiniMax-VL-01. This series achieves efficient long-text processing through an innovative linear attention mechanism and an extremely large number of parameters, matching the performance of top international models.

【AiBase Summary:】
🧠 The MiniMax-01 series models adopt an innovative linear attention mechanism, breaking the limitations of traditional architectures and supporting context processing of up to 4 million tokens.
💡 This series has matched the performance of GPT-4o and Claude-3.5-Sonnet across multiple tasks, especially excelling in long-text tasks.
💰 MiniMax offers text and multimodal understanding API services at the industry's lowest prices, with standard pricing at 1 yuan per million input tokens and 8 yuan per million output tokens.
Details link: https://github.com/MiniMax-AI

3. Zhou Hongyi Stars in AI Short Drama, Featuring AI Effects and Hardware

Zhou Hongyi, founder of 360 Group, announced his participation in the filming of China's first AI short drama, which will start production in Xi'an and is scheduled to launch during the Spring Festival. The short drama, themed around time travel, is expected to have 60 episodes and aims to convey positive energy while avoiding cliché plots. Zhou hopes to showcase AI technology through the drama and promote its integration into everyday life, while also advancing 360's nano AI search product development.

【AiBase Summary:】
🌟 The short drama will start production in Xi'an and is planned to launch during the Spring Festival, with a time travel theme and an expected 60 episodes.
🤖 Special effects will be generated by nano AI search, reducing filming costs and enhancing visual effects.
📚 The aim is to popularize AI knowledge and help everyone master AI technology, bridging the digital divide.

4. Alibaba DAMO Academy Launches Multimodal Large Model for E-commerce Scenarios, Valley 2

Alibaba's DAMO Academy has launched Valley 2, a multimodal large language model specifically designed for e-commerce scenarios, aimed at enhancing performance across various fields and expanding application boundaries. This model combines advanced visual encoders and innovative processing modules, demonstrating outstanding performance in multiple benchmark tests, marking a significant advancement in multimodal language models.

【AiBase Summary:】
🌟 Valley 2 is designed for e-commerce scenarios, utilizing Qwen 2.5 as the backbone, combined with the SigLIP-384 visual encoder to enhance multimodal processing capabilities.
📊 The training process includes text-visual alignment and chain reasoning post-training, ensuring high efficiency in solving complex problems.
🏆 Valley 2 has excelled in multiple public benchmark tests, especially surpassing similarly sized models in e-commerce applications.
Details link: https://www.modelscope.cn/models/bytedance-research/Valley-Eagle-7B

5. ChatGPT Agents Are Here! Launches "Tasks" Feature: Smartly Handling Reminders and To-Dos

OpenAI has recently launched a new feature for ChatGPT called "Tasks," allowing users to schedule future actions and reminders, making it more like a traditional digital assistant. This feature has now been rolled out to Plus, Team, and Pro subscribers, enabling users to simply input tasks and times, and ChatGPT will handle these requests. Currently, it is only available to paid users.

【AiBase Summary:】
✅ The new "Tasks" feature allows users to schedule future actions and reminders, enhancing ChatGPT's practicality.
🔔 Users can inform ChatGPT of their desired tasks and times through simple input, easily managing daily affairs.
💼 Currently, it is only available to paid users, and it is unclear whether it will be available to free users; it is expected to remain a premium feature.

6. Kokoro-TTS, a Small Text-to-Speech Model, Previously Ranked First in TTS Charts

Kokoro is a newly released speech synthesis model with 82 million parameters, quickly making its mark in the TTS field. After being launched on the Hugging Face platform, it achieved the top rank with less than 100 hours of audio data, showcasing exceptional cost-effectiveness. Although there are currently limitations in voice cloning, the compliance and efficiency of its training process lay the groundwork for future development.

【AiBase Summary:】
🌟 Kokoro-82M is a newly released speech synthesis model with 82 million parameters, supporting multiple voice packs.
🎤 This model excels in the TTS field, having previously ranked first on the charts with less than 100 hours of audio data for training.
📊 The training of the Kokoro model used data under an open license, ensuring compliance, but there are still some functional limitations.
Details link: https://huggingface.co/hexgrad/Kokoro-82M

7. Topview AI Launches the World's First Digital Human "Product Avatar" Supporting Product Generation

Topview AI's "Product Avatar" digital human solution brings revolutionary changes to the e-commerce industry. Merchants only need to upload product images, and the AI can generate a digital human holding the product and provide a voice-over, greatly saving filming time and costs. This product also supports multiple languages and customization, marking a new phase of AI-driven e-commerce marketing.

【AiBase Summary:】
🤖 AI digital humans can be generated quickly, eliminating the need for real models, saving time and costs.
🌍 Supports over 1,000 digital human models and 28 languages, meeting global market demands.
🎥 Flexible and efficient product display mode, allowing merchants to change products at any time, enhancing promotional efficiency.
Details link: https://www.topview.ai/ai-product-avatar

8. Nvidia Invests $4 Million in MetAI to Transform CAD Files into 3D Worlds in Minutes

Nvidia recently made a $4 million seed round investment in the startup MetAI, aimed at advancing AI digital twin technology. MetAI focuses on rapidly converting CAD files into functional 3D environments using AI and 3D technology, significantly shortening the time needed to create digital twins. The company plans to move its headquarters to the United States in 2025 and expand its R&D team to meet the growing market demand.

【AiBase Summary:】
🌟 Nvidia invests $4 million in startup MetAI to promote AI digital twin technology development.
🤖 MetAI utilizes AI and 3D technology to quickly convert CAD files into functional 3D environments, shortening digital twin creation time.
🚀 MetAI plans to relocate its headquarters to the U.S. in 2025 and expand its R&D team to meet the increasing market demand.

9. iFLYTEK Spark 4.0 Turbo Upgrades Seven Core Capabilities: Math and Coding Abilities Surpass GPT-4o

The comprehensive upgrade of iFLYTEK Spark 4.0 Turbo marks another significant breakthrough for iFLYTEK in the field of artificial intelligence. This upgrade not only achieves significant improvements in seven core capabilities such as text generation and language understanding but also surpasses GPT-4o in math and coding abilities, especially showing stronger capabilities in handling complex mathematical problems.

【AiBase Summary:】
🔢 Math capabilities have significantly improved, surpassing GPT-4o, capable of handling complex mathematical problems.
💻 The newly launched Spark deep reasoning model X1 has 175 billion parameters, suitable for deep data analysis.
📈 iFLYTEK has invested a total of 12.5 billion yuan in R&D since 2020, supporting the continuous development of AI technology.

10. Gemini AI Achieves New Breakthrough in Visual Processing: Real-time Video and Static Image Synchronized Analysis

Google's Gemini AI has recently achieved a significant breakthrough in visual processing, capable of simultaneously processing real-time video and static images. This technology was demonstrated through the experimental application AnyChat, marking advancements in multi-stream processing with artificial intelligence. Developers can leverage Gemini's architecture to create custom platforms applicable in education, art, and more, showcasing broad application potential.

【AiBase Summary:】
🌟 Gemini AI achieves synchronized processing of real-time video and static images, breaking previous limitations.
🎨 The AnyChat platform showcases the broad application potential of AI in education, art, and other fields.
🚀 Developers can easily utilize Gemini's technology to build their own visual AI applications.
Details link: https://huggingface.co/spaces/akhaliq/anychat

11. iFLYTEK Launches Spark Simultaneous Interpretation Voice Model: Achieves Human Expert Interpreter Level

iFLYTEK today launched the Spark simultaneous interpretation voice model, marking the emergence of the first large model in China with end-to-end simultaneous interpretation capabilities. The introduction of this technology significantly enhances the fluency and accuracy of translations, especially in international communication scenarios. This model supports real-time translation in multiple languages, with response times reduced to under 5 seconds, achieving human expert translation levels, indicating future convenience and efficiency in international communication.

【AiBase Summary:】
🚀 The Spark simultaneous interpretation voice model is the first large model in China with end-to-end simultaneous interpretation capabilities, significantly improving translation quality.
🌍 The model achieves almost no delay in English to Chinese translations, suitable for international exhibitions and tourism scenarios.
⚡ Supports streaming translation and adaptive speech rate adjustment, greatly enhancing the naturalness and fluency of translations, surpassing international counterparts.

12. OpenBMB Releases Multimodal Model MiniCPM-o2.6: Mobile Devices Can Handle Visual and Voice Processing

OpenBMB has launched MiniCPM-o2.6, a multimodal model with 8 billion parameters, aimed at addressing the challenges of high computational resource demands and compatibility with edge devices. This model performs excellently in visual, voice, and language processing, capable of running efficiently on smartphones and tablets. Through modular design, MiniCPM-o2.6 integrates various powerful components, supporting real-time processing and multilingual capabilities.

【AiBase Summary:】
🌟 MiniCPM-o2.6 is a multimodal model with 8 billion parameters, capable of efficiently running on edge devices, supporting visual, voice, and language processing.
🚀 This model performs excellently in the OpenCompass benchmark tests, with visual task scores exceeding those of GPT-4V, and possesses multilingual processing capabilities.
🛠️ MiniCPM-o2.6 features real-time processing, voice cloning, and emotion control functionalities, suitable for innovative applications across various industries such as education and healthcare.
Details link: https://huggingface.co/openbmb/MiniCPM-o-2_6

AI Daily News

AI Daily: Kimi Releases Multimodal Image Understanding Model API; Zhou Hongyi Stars in AI Short Drama Shooting; MiniMax-01 Series Models Open Source; Xinghuo Simultaneous Translation Voice Model Released

站长之家

This article is from AIbase Daily

AI News Recommendations

Domestic AI Newcomers Compete with OpenAI: DeepSeek, Kimi, and MiniMax Shine

Kimi Launches New SOTA Model: k1.5 Multimodal Thinking Model Debuts

Launch of the Kimi Multimodal Image Understanding Model API

The k1 Series Reinforcement Learning Model Debuts! The Dark Side of the Moon's Kimi Launches Visual Thinking Model