The Future Is Here! Alibaba's New Voice Technology CosyVoice Makes AI Speak More Naturally

AIbase基地

Published inAI News · 5 min read · Aug 2, 2024

1.3k

Recently, Alibaba's latest voice synthesis model, CosyVoice, has unveiled an impressive blueprint for future human-machine interaction with its astonishing realism and flexibility.

This model is capable of generating voices that match specific genders, ages, and personalities, while also simulating natural human speech characteristics such as laughter, coughing, and breathing. More excitingly, it can even infuse the generated voices with emotions and styles, making AI expressions more vibrant and diverse.

QQ截图20240802094237.jpg

However, CosyVoice represents just the tip of the iceberg in Alibaba's voice technology domain. Together with another model named SenseVoice, they form a powerful framework called FunAudioLLM. This framework aims to comprehensively enhance the voice interaction experience between humans and large language models (LLMs). SenseVoice is responsible for high-precision multilingual speech recognition, emotion recognition, and audio event detection, supporting over 50 languages with astonishingly fast response times.

The application prospects of FunAudioLLM are highly anticipated. Imagine effortlessly achieving real-time voice translation and seamlessly communicating with people who speak different languages. Alternatively, you could experience a heartfelt AI voice chat where the AI responds appropriately based on your emotional state. For literature enthusiasts, this technology can also create expressive audiobooks, making the listening experience more immersive.

Specifically, the speech-to-speech translation function of FunAudioLLM is nothing short of magical. When you speak a sentence, SenseVoice quickly recognizes your voice, processes it through a large language model, and finally, CosyVoice articulates it in another language. This process is fast and accurate, making cross-language communication smoother than ever before.

In terms of emotional interaction, FunAudioLLM also performs exceptionally well. It not only understands the user's emotional state but also generates corresponding emotional voice responses. This function will play a significant role in scenarios requiring emotional interaction, such as psychological counseling and online education, providing users with more humanized and warm experiences.

For literature lovers, the audiobook production technology brought by FunAudioLLM is undoubtedly a blessing. By analyzing the emotions in the book, CosyVoice can provide more vivid and emotional readings, allowing listeners to feel as if they are in the story, deeply experiencing the emotions the author wants to convey.

Alibaba's technological breakthrough not only showcases China's innovative capabilities in AI but also heralds a new era of human-machine interaction. In the near future, our conversations with AI may become so natural that it will be difficult to distinguish whether it is a real human. This technological development will undoubtedly bring revolutionary changes to multiple fields such as education, entertainment, and customer service, making our lives more convenient and vibrant.

With continuous technological advancements, we have reason to believe that future AI will not only understand our words but also truly comprehend our emotions, becoming an indispensable intelligent companion in our lives. Alibaba's CosyVoice and FunAudioLLM framework undoubtedly pave the way for this promising future. Let us look forward to the not-too-distant future when interacting with AI will be as natural and enjoyable as chatting with an old friend.

Project link: https://top.aibase.com/tool/cosyvoice

Tongyi Lab Launches Speech Recognition Large Model Fun-ASR1.5, Capable of Instantly Converting 30 Languages, Dialects, and Ancient Poetry!

Tongyi Lab launches the Fun-ASR1.5 speech recognition large model, achieving a balance between versatility and accuracy through a unified architecture. The model supports 30 mainstream languages globally and is deeply adapted to the seven major Chinese dialects and over 20 regional accents, demonstrating outstanding performance in multilingual, multi-dialectal, and complex contexts.

Researchers Launch LPM1.0 Model: Achieving Real-Time Interactive Digital Human Video from a Single Image

The release of the LPM1.0 model enables real-time generation of videos showing a person speaking, listening, and singing based on a single reference image. Its core breakthrough lies in multimodal processing, which can synchronously integrate text, audio, and images to generate dynamic scenes with accurate lip synchronization, subtle expressions, and natural emotional transitions. The model supports integration with mainstream speech AI systems such as ChatGPT, upgrading traditional voice conversations into real-time interactive experiences with visual feedback.

Voice Actors Speak Out Against AI Voice Synthesis, Taiyi Zhenren's Voice Actor Zhang Jiaming Reports Over 700 Cases of Infringement in a Single Day

Several voice actors have recently spoken out against AI voice synthesis and voice theft, calling for stronger industry regulation. Among them, Zhang Jiaming, the voice actor for "Taiyi Zhenren" in "Ne Zha: The Reckoning", stated that his voice has been widely synthesized by AI for commercial use, and he faces difficulties in rights protection due to complex infringement subjects and time-consuming evidence collection.

AI Daily: MiniMax Launches Music 2.6; Coze 2.5 Major Upgrade; AI Personality Test Product SBTI Goes Viral Online

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Click to learn more about new AI products: https://app.aibase.com/zh1. AI personality test product SBTI goes viral online: The AI personality test product SBTI, which uses absurd tags and AI synthesis technology, quickly goes viral online with its absurd abstract tags and deconstructive expressions.

The AI Personality Test Product SBTI Becomes Popular on the Internet: Focus on Abstract Tags and AI Synthesis Technology

A mobile application called "SBTI" has recently become popular online. It uses the slogan "MBTI is outdated," replacing traditional personality classification with self-deprecating tags such as "beauty" and "maoero," catering to the current internet subculture aesthetics. The developers stated that the application is not based on a professional psychological background.

ByteDance Launches Native Full-Duplex Speech Large Model Seeduplex: Listen Carefully and Resist Interference

ByteDance launches the native full-duplex speech large model Seeduplex, achieving 'simultaneous listening and speaking' processing, advancing AI speech interaction from 'turn-based' to 'real-time natural interaction'. The technology has been fully launched in the Douyin App, achieving large-scale deployment for hundreds of millions of users.

Xiaomi Open Sources Major Project! OmniVoice Covers 600+ Languages for Zero-Shot Speech Cloning TTS: WER Only 0.84%, 40 Times Faster, Small Languages Can Also Be Resurrected Easily

Xiaomi Kaldi team open-sources the OmniVoice model, supporting over 600 languages. It achieves SOTA performance in multiple metrics on Chinese and multilingual TTS benchmark tests. The Chinese WER is as low as 0.84%, and the multilingual performance surpasses mainstream commercial models, achieving a new breakthrough in speech synthesis.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Ranking Optimization

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

The Future Is Here! Alibaba's New Voice Technology CosyVoice Makes AI Speak More Naturally

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Tongyi Lab Launches Speech Recognition Large Model Fun-ASR1.5, Capable of Instantly Converting 30 Languages, Dialects, and Ancient Poetry!

Google Releases Its Strongest TTS Model, Supporting Nearly 70 Languages

Researchers Launch LPM1.0 Model: Achieving Real-Time Interactive Digital Human Video from a Single Image

AI Infringement Troubles Voice Actors, Zhang Jiaming Calls for Industry Boycott

Voice Actors Speak Out Against AI Voice Synthesis, Taiyi Zhenren's Voice Actor Zhang Jiaming Reports Over 700 Cases of Infringement in a Single Day

AI Daily: MiniMax Launches Music 2.6; Coze 2.5 Major Upgrade; AI Personality Test Product SBTI Goes Viral Online

The AI Personality Test Product SBTI Becomes Popular on the Internet: Focus on Abstract Tags and AI Synthesis Technology

ByteDance Launches Native Full-Duplex Speech Large Model Seeduplex: Listen Carefully and Resist Interference

Xiaomi Open Sources Major Project! OmniVoice Covers 600+ Languages for Zero-Shot Speech Cloning TTS: WER Only 0.84%, 40 Times Faster, Small Languages Can Also Be Resurrected Easily

Meituan Launches Native Multimodal LongCat-Next: Visual and Speech Achieve Bottom-Level Unification

AI News Recommendations

Tongyi Lab Launches Speech Recognition Large Model Fun-ASR1.5, Capable of Instantly Converting 30 Languages, Dialects, and Ancient Poetry!

Google Releases Its Strongest TTS Model, Supporting Nearly 70 Languages

Researchers Launch LPM1.0 Model: Achieving Real-Time Interactive Digital Human Video from a Single Image

AI Infringement Troubles Voice Actors, Zhang Jiaming Calls for Industry Boycott

Voice Actors Speak Out Against AI Voice Synthesis, Taiyi Zhenren's Voice Actor Zhang Jiaming Reports Over 700 Cases of Infringement in a Single Day

AI Daily: MiniMax Launches Music 2.6; Coze 2.5 Major Upgrade; AI Personality Test Product SBTI Goes Viral Online

The AI Personality Test Product SBTI Becomes Popular on the Internet: Focus on Abstract Tags and AI Synthesis Technology

ByteDance Launches Native Full-Duplex Speech Large Model Seeduplex: Listen Carefully and Resist Interference

Xiaomi Open Sources Major Project! OmniVoice Covers 600+ Languages for Zero-Shot Speech Cloning TTS: WER Only 0.84%, 40 Times Faster, Small Languages Can Also Be Resurrected Easily

Meituan Launches Native Multimodal LongCat-Next: Visual and Speech Achieve Bottom-Level Unification