Nexa AI Launches OmniAudio-2.6B: A Fast Audio Language Model for Edge Deployment

AIbase基地

Published inAI News · 4 min read · Dec 16, 2024

233

Nexa AI has recently launched its new OmniAudio-2.6B audio language model, designed to meet the efficient deployment needs of edge devices. Unlike traditional architectures that separate automatic speech recognition (ASR) and language models, OmniAudio-2.6B integrates Gemma-2-2b, Whisper Turbo, and a custom projector into a unified framework. This design eliminates the inefficiencies and delays caused by linking various components in traditional systems, making it particularly suitable for devices with limited computational resources.

Main Highlights:

Processing Speed: OmniAudio-2.6B performs exceptionally well. On the 2024 Mac Mini M4Pro, using the Nexa SDK and the FP16GGUF format, the model can achieve a processing speed of 35.23 tokens per second, while in the Q4_K_M GGUF format, it can process 66 tokens per second. In comparison, Qwen2-Audio-7B can only process 6.38 tokens per second on similar hardware, demonstrating a significant speed advantage.

Resource Efficiency: The model's compact design effectively reduces dependence on cloud resources, making it an ideal choice for power- and bandwidth-constrained wearable devices, automotive systems, and IoT devices. This feature allows it to operate efficiently under limited hardware conditions.

High Accuracy and Flexibility: Although OmniAudio-2.6B focuses on speed and efficiency, it also performs well in terms of accuracy, making it suitable for various tasks such as transcription, translation, and summarization. Whether for real-time speech processing or complex language tasks, OmniAudio-2.6B can provide precise results.

The launch of OmniAudio-2.6B marks another significant advancement for Nexa AI in the field of audio language models. Its optimized architecture not only enhances processing speed and efficiency but also opens up more possibilities for edge computing devices. With the continuous proliferation of IoT and wearable devices, OmniAudio-2.6B is expected to play a vital role in various application scenarios.

Model Address: https://huggingface.co/NexaAIDev/OmniAudio-2.6B

Product Address: https://nexa.ai/blogs/omniaudio-2.6b

Chinese Visual and Speech Open Source Model VITA-1.5 Released with GPT-4o Level Advanced Speech and Visual Capabilities

Recently, significant progress has been made in multimodal large language models (MLLMs), particularly in the integration of visual and text modalities. However, with the increasing prevalence of human-computer interaction, the importance of the speech modality has become more prominent, especially in multimodal dialogue systems. Speech is not only a key medium for information transmission but also significantly enhances the naturalness and convenience of interactions. Nevertheless, due to the inherent differences between visual and speech data, integrating them into MLLMs is not an easy task. For example, visual data conveys spatial information, while speech data conveys information in a temporal sequence.

ByteDance's Automatic Speech Recognition Model Seed-ASR: Understands Various Accents and Dialects!

The Seed-ASR engine launched by ByteDance achieves high-precision recognition of Mandarin, 13 Chinese dialects, and 7 foreign languages through massive training data, significantly enhancing the convenience of cross-language communication. Its key advantage lies in its excellent contextual awareness, accurately recognizing proper nouns, place names, and keywords by incorporating historical information, especially performing exceptionally well in specific scenarios, thereby improving recognition accuracy. Whether in daily conversations, complex meetings, or interactions among multiple people in noisy environments, Seed-ASR can transcribe accurately. It can also recognize various professional terms.

NVIDIA Launches New AI Speech Recognition Model Parakeet, Claimed to Outperform Whisper

NVIDIA NeMo has introduced the Parakeet ASR model, achieving exceptional speech recognition accuracy. The Parakeet model is based on RNN Transducer and Connectionist Temporal Classification decoders, featuring 60-110 million parameters. The Parakeet model has shown outstanding performance across various benchmark datasets, making it suitable for speech transcription in different vocal environments.

Amazon Launches New ASR System Supporting Over 100 Languages

Amazon has released a next-generation ASR system that covers over 100 languages, providing comprehensive automatic speech recognition services. The speech foundation model improves accuracy by 20% to 50%, with enhancements of 30% to 70% in challenging areas such as telephone speech. The system supports multiple features, including automatic punctuation, custom vocabulary, automatic language identification, and speaker separation. Thousands of businesses are leveraging Amazon Transcribe to unlock insights from audio content, enhancing accessibility and discoverability.

AI Chips for Edge Devices: The New Battleground for Global Chip Manufacturers

As AI technology at the device level becomes an emerging trend, global chip manufacturers are competing to produce chips that support embedded AI. Samsung, Qualcomm, Intel, and AMD are increasing their investments to meet the growing demand from consumer electronics manufacturers for integrated AI solutions. AI technology at the device level for smartphones and laptops is emerging as a new battleground for these chip manufacturers. Compared to cloud-based generative AI, device-level AI offers higher security, lower costs, and more personalized features with less power consumption. Industry analysts anticipate significant growth in the AI-supported personal computer market in the future.

TinyLlama: A 550MB AI Model Trained on 3 Trillion Tokens in Just 90 Days

TinyLlama, developed by a research team at the Singapore University of Technology and Design, is a compact yet powerful AI model that occupies only 550MB of memory. The model is planned to be trained on a dataset of 3 trillion tokens within 90 days, designed to suit memory-constrained edge devices. The success of TinyLlama will provide high-performance AI solutions for various applications, such as real-time machine translation. The team utilized 16 A100-40G GPUs and aims to complete the training in 90 days.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview