Nexa AI recently unveiled its new OmniAudio-2.6B audio language model, designed to meet the efficient deployment demands of edge devices. Unlike traditional architectures that separate automatic speech recognition (ASR) and language models, OmniAudio-2.6B integrates Gemma-2-2b, Whisper Turbo, and a custom projector into a unified framework. This design eliminates the inefficiencies and delays associated with linking various components in traditional systems, making it especially suitable for resource-constrained computing.
The American startup Useful Sensors has launched an open-source speech recognition model called Moonshine. Moonshine is designed to process audio data more efficiently, using computational resources more economically and achieving processing speeds five times faster than OpenAI's Whisper. This new model is specifically built for real-time applications on resource-constrained hardware and features a flexible architecture. Unlike Whisper, which processes audio in fixed 30-second segments, Moonshine offers a different approach.
Moore Threads has announced the open-source release of its audio understanding model, MooER, making it the first large-scale open-source speech model based on domestically produced full-feature GPUs. MooER supports Chinese and English speech recognition and translation, utilizing a three-part model structure that demonstrates robust multilingual processing capabilities. The inference code and a model trained on 5000 hours of data have been released as open source, with plans to further open-source training code and an enhanced version trained on 80,000 hours of data. In comparative testing, MooER-5K has shown outstanding performance, achieving a Chinese CER of 4.21% and an English WER of 17.98%, particularly.
At the Volcano Engine AI Innovation Roadshow in Shanghai on August 21, 2024, Volcano Engine showcased a comprehensive upgrade of the Doubao Large Model. This includes the Doubao Text-to-Image Model, which has improved text and image matching capabilities for long texts, the Doubao Speech Recognition Model, which reduced error rates by up to 40% across multiple public test sets, and upgrades to the Doubao Speech Synthesis Model, enhancing streaming speech synthesis abilities for real-time responses and accurate punctuation. Volcano Engine also released a real-time interactive solution for Conversational AI, integrating the Doubao Large Model with real-time audio and video technology, providing end-to-end capabilities.