Israeli AI startup aiOla has recently made a significant move, announcing the launch of a new open-source speech recognition model, Whisper-Medusa.

image.png

This model is no simple feat; it is 50% faster than OpenAI's renowned Whisper! Built on the foundation of Whisper, it incorporates a novel "multi-head attention" architecture, allowing it to predict tokens far exceeding OpenAI's products. Moreover, the code and weights have been released under the MIT license on Hugging Face, permitting both research and commercial use.

According to Gill Hetz, aiOla's Vice President of Research, open-source initiatives encourage community innovation and collaboration, leading to faster and more refined results. This work paves the way for complex AI systems, enabling them to understand and respond to user queries almost in real-time.

In an era where foundational models can generate a variety of content, advanced speech recognition remains crucial. Like Whisper, which handles complex speech in various languages and accents, with over 5 million downloads per month, it supports numerous applications and has become the gold standard in speech recognition.

So, what makes aiOla's Whisper-Medusa special?

The company has modified Whisper's architecture, adding a multi-head attention mechanism that can predict 10 tokens at a time, boosting speed by 50% without compromising accuracy. The model was trained using a weakly supervised machine learning method, with future versions promising even greater advancements. Importantly, since Whisper-Medusa's backbone is built on Whisper, the speed enhancement does not come at the expense of performance.

During the training of Whisper-Medusa, aiOla employed a weakly supervised machine learning method. As part of this, it froze the main components of Whisper and used model-generated audio transcriptions as labels to train additional token prediction modules.

image.png

When asked if any companies had early access to Whisper-Medusa, Hetz mentioned that they had tested it on real-world enterprise data use cases, ensuring accurate performance in practical scenarios, with the potential to make voice applications more responsive in the future. Ultimately, he believes that the improvement in recognition and transcription speed will reduce the turnaround time for voice applications and pave the way for real-time responses.

Key Points:

💥50% Faster: aiOla's Whisper-Medusa significantly enhances speech recognition speed over OpenAI's Whisper.

🎯No Loss in Accuracy: The speed increase maintains the same accuracy as the original model.

📈Broad Application Potential: It is expected to accelerate responses in voice applications, improve efficiency, and reduce costs.