The American startup Useful Sensors has launched an open-source speech recognition model called Moonshine. Moonshine is designed to process audio data more efficiently, using computational resources more economically and achieving processing speeds five times faster than OpenAI's Whisper. This new model is specifically built for real-time applications on resource-constrained hardware and features a flexible architecture. Unlike Whisper, which processes audio in fixed 30-second segments, Moonshine offers a different approach.
Moore Threads has announced the open-source release of its audio understanding model, MooER, making it the first large-scale open-source speech model based on domestically produced full-feature GPUs. MooER supports Chinese and English speech recognition and translation, utilizing a three-part model structure that demonstrates robust multilingual processing capabilities. The inference code and a model trained on 5000 hours of data have been released as open source, with plans to further open-source training code and an enhanced version trained on 80,000 hours of data. In comparative testing, MooER-5K has shown outstanding performance, achieving a Chinese CER of 4.21% and an English WER of 17.98%, particularly.
At the Volcano Engine AI Innovation Roadshow in Shanghai on August 21, 2024, Volcano Engine showcased a comprehensive upgrade of the Doubao Large Model. This includes the Doubao Text-to-Image Model, which has improved text and image matching capabilities for long texts, the Doubao Speech Recognition Model, which reduced error rates by up to 40% across multiple public test sets, and upgrades to the Doubao Speech Synthesis Model, enhancing streaming speech synthesis abilities for real-time responses and accurate punctuation. Volcano Engine also released a real-time interactive solution for Conversational AI, integrating the Doubao Large Model with real-time audio and video technology, providing end-to-end capabilities.
The Seed-ASR engine launched by ByteDance achieves high-precision recognition of Mandarin, 13 Chinese dialects, and 7 foreign languages through massive training data, significantly enhancing the convenience of cross-language communication. Its key advantage lies in its excellent contextual awareness, accurately recognizing proper nouns, place names, and keywords by incorporating historical information, especially performing exceptionally well in specific scenarios, thereby improving recognition accuracy. Whether in daily conversations, complex meetings, or interactions among multiple people in noisy environments, Seed-ASR can transcribe accurately. It can also recognize various professional terms.