Recently, Fish Audio introduced the latest voice processing model, Fish Agent V0.13B. This state-of-the-art text-to-speech model excels in generating and processing speech efficiently and accurately, particularly in simulating or cloning various voices. This advancement brings us closer to having a natural and responsive AI voice assistant.
Fish Agent V0.13B is pre-trained on the Qwen-2.5-3B-Instruct and utilizes a massive dataset comprising 200 billion voice and text tokens. Unlike traditional models that require complex semantic encoding, Fish Agent V0.13B employs a "semantic-free token" architecture, directly processing and generating speech at the sound level. This direct approach not only simplifies the model structure but also enhances its responsiveness and efficiency.
Thanks to this innovative architecture, Fish Agent V0.13B can generate high-quality speech quickly and naturally, achieving "instant" voice cloning and text-to-speech conversion with a Text-to-Audio conversion time (TTFA) of just 200 milliseconds. This feature makes it ideal for applications requiring real-time speech generation, such as voice assistants, automated customer service, and other scenarios needing rapid voice feedback.
Fish Agent V0.13B supports multiple languages including English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic, and is trained on approximately 700,000 hours of multilingual audio data. This means it can handle various languages and contexts, producing speech that is more natural and closer to human pronunciation.
In addition to voice generation and text-to-speech conversion, Fish Agent V0.13B boasts the following key features:
Zero-shot voice cloning: Enables voice cloning without the need for training.
Streamlined 3B parameters: Utilizes 3 billion parameters, facilitating development.
Supports text and audio input: Offers flexible multi-input methods.
Currently, Fish Audio has open-sourced the Fish Agent V0.13B model and provided a preliminary demo version for user experience. This release will further propel the development of AI voice technology, offering more possibilities for applications like voice assistants and virtual humans.
GitHub: https://github.com/fishaudio/fish-speech
Fish Agent Demo: https://huggingface.co/spaces/fishaudio/fish-agent
Model Download: https://huggingface.co/fishaudio/fish-agent-v0.1-3b
Technical Report: https://arxiv.org/abs/2411.01156