Doubao Large Model Release: 8 Key Moments of 2024 - From AI Rising Star to Complete Breakthrough

AIbase基地

Published inAI News · 8 min read · Dec 30, 2024

422

Today, the Doubao Large Model team officially released the 8 key moments of the Doubao Large Model! Since its debut on May 15, 2024, the Doubao Large Model has made significant strides in just 230 days. From its initial attempts at language learning to exploring the world and creating fantastical dreams for creators, each step of this journey has been filled with challenges and achievements.

1. Breakthroughs in Speech Recognition and Emotional Expression

In July, the Doubao Large Model achieved a major breakthrough in the field of speech recognition: it can understand mixed conversations in over 20 dialects and possesses the ability to think while listening. Moreover, it learned to express emotions during conversations, seamlessly interjecting and even preserving nuances like slurring and accents that are characteristic of human speech. The core technologies behind this are the Doubao Speech Recognition Model Seed-ASR and the Speech Generation Model Seed-TTS, which integrate a wider range of data and reasoning chains, giving it strong generalization capabilities.

2. The Birth of the AI Band

In September, the Doubao Large Model creatively realized the concept of an "AI Band." From songwriting to performance generation and vocal singing, the Doubao Large Model mastered over 10 music creation skills, bringing unexpected inspiration to music creation. The underlying technology is the Seed-Music framework, which combines the strengths of language models and diffusion models to achieve a universal framework for music generation with high editing controllability.

3. Precise Video Generation and Camera Control

In the same month, the Doubao Large Model further broke creative boundaries by generating high-definition videos with multiple subjects while precisely controlling camera angles based on complex prompts. With the help of the PixelDance and Seaweed video generation models, the Doubao Large Model can produce high-quality video and audio synchronously, providing creators with a more realistic and dreamlike visual experience.

4. Upgraded Image Editing and Creation Capabilities

In November, the Doubao Large Model acquired the abilities of "one-sentence image editing" and "one-click poster generation." Users can achieve precise image editing and text generation with simple text commands. Through the continually iterated text-to-image model SeedEdit, Doubao can accurately present complex scenes, providing natural language-driven image editing.

5. Leap in Programming Capabilities

By December, the programming capabilities of the Doubao Large Model saw significant enhancement, transforming it into an AI programmer and data analyst. With Doubao MarsCode, users can easily write code, process data, and perform visual analysis. Doubao's code model, Doubao-coder, deeply supports 16 programming languages and meets the full-stack programming needs for front-end and back-end development as well as machine learning.

6. Extreme Text Understanding and Processing Capabilities

The Doubao Large Model also broke the limits of contextual windows, enhancing its capacity to 3 million words, allowing it to process larger volumes of text with a mere 15-second delay for every million tokens processed. Using algorithms like STRING, the Doubao Large Model can rapidly acquire vast external knowledge and provide more accurate understanding capabilities.

7. Breakthroughs in Visual Perception and Deep Thinking

In mid-December, the Doubao Large Model achieved visual perception abilities and could integrate multiple senses for deep thinking. It not only accurately understands images but can also perform complex calculations, such as capturing a calculus problem, showcasing its outstanding cross-modal learning and reasoning capabilities.

8. Comprehensive Upgrade of the General Model Doubao-pro

In mid-December, the Doubao general model Doubao-pro underwent a comprehensive upgrade, aligning its capabilities with GPT-4 and learning to "reflect" during responses. This upgrade improved Doubao-pro's understanding accuracy and generation quality, making it an efficient "hexagonal warrior" with balanced performance across various abilities, setting a new benchmark in the AI field.

This year, the Doubao Large Model team has made significant progress in AI fundamental research. The team published 57 papers and presented at top conferences such as ICLR, CVPR, and NeurIPS. Additionally, the Doubao Large Model team has established joint laboratories with several top universities to promote the development of AI technology.

The Doubao Large Model has not only achieved breakthroughs in technology but has also been widely applied across various industries. Through the Volcano Engine, the Doubao Large Model serves over 30 industries, with a daily token usage exceeding 4 trillion, a 33-fold increase since its release in May.

Official address: https://mp.weixin.qq.com/s/KVfu86njzyK2iK4j6VJONw

ByteDance's Automatic Speech Recognition Model Seed-ASR: Understands Various Accents and Dialects!

The Seed-ASR engine launched by ByteDance achieves high-precision recognition of Mandarin, 13 Chinese dialects, and 7 foreign languages through massive training data, significantly enhancing the convenience of cross-language communication. Its key advantage lies in its excellent contextual awareness, accurately recognizing proper nouns, place names, and keywords by incorporating historical information, especially performing exceptionally well in specific scenarios, thereby improving recognition accuracy. Whether in daily conversations, complex meetings, or interactions among multiple people in noisy environments, Seed-ASR can transcribe accurately. It can also recognize various professional terms.

Israeli Company Launches Open Source Speech Recognition Model Whisper Medusa with 50% Speed Increase

Israeli AI company aiOla has released an open source speech recognition model named Whisper Medusa, which is based on an improved architecture design that incorporates multi-head attention mechanisms, allowing it to process speech 50% faster than OpenAI's Whisper model. Whisper Medusa makes parallel predictions of ten tokens instead of the traditional one at a time, significantly enhancing speech recognition speed while maintaining performance. Its innovative training method employs weak supervision, freezing the backbone system and utilizing...