Spark-TTS: A Text-to-Speech System Supporting Zero-Shot Voice Cloning and Fine-grained Control

AIbase基地

Published inAI News · 4 min read · Mar 6, 2025

121

Recently, Spark-TTS, an advanced text-to-speech system, has sparked significant discussion within the AI community. According to recent X posts and research, this system stands out for its zero-shot voice cloning and fine-grained voice control capabilities, representing a major breakthrough in speech synthesis.

Leveraging the power of large language models (LLMs), this system aims for highly accurate and natural speech synthesis, suitable for both research and commercial applications. Spark-TTS is designed for simplicity and efficiency. Built entirely on Qwen2.5, it eliminates the complex process of requiring additional generative models. Unlike others, Spark-TTS directly reconstructs audio from the code predicted by the LLM, significantly simplifying audio generation, improving efficiency, and reducing technical complexity.

Beyond efficient audio generation, Spark-TTS boasts excellent voice cloning capabilities. It supports zero-shot voice cloning, meaning it can successfully replicate a speaker's voice even without training data specific to that speaker.

Key Features of Spark-TTS:

Zero-shot Voice Cloning: Generates a speaker's voice style without needing training data, ideal for rapid personalization.

Fine-grained Voice Control: Users can precisely adjust speech rate and pitch, such as speeding up/slowing down speech or changing the voice's intonation.

Cross-lingual Generation: Supports multiple languages, including English and Chinese, expanding its global applicability.

Its speech quality is considered highly natural, particularly suitable for audiobook production, a point confirmed by user feedback.

Technical Architecture

Spark-TTS is based on the BiCodec single-stream speech codec. This codec decomposes speech into two tokens:

Low-bitrate semantic tokens, responsible for linguistic content.

Fixed-length global tokens, responsible for speaker attributes.

This separation allows for flexible adjustment of speech characteristics. Combined with Qwen-2.5's Chain-of-Thought (CoT) technology, it further enhances the quality and controllability of speech generation. Qwen-2.5, a large language model (LLM), provides powerful semantic understanding.

Spark-TTS also excels in language support. It handles both Chinese and English simultaneously, maintaining high naturalness and accuracy in cross-lingual synthesis. Furthermore, users can customize the virtual speaker by adjusting parameters like gender, tone, and speech rate.

Project: https://github.com/SparkAudio/Spark-TTS

Spark-TTS Text-to-Speech Speech Synthesis Qwen2.5

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Dia: A Revolutionary Open-Source TTS Model with Emotion and Non-Verbal Cues

Nari Labs, a two-person startup, has released Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to generate natural conversations directly from text prompts. Co-founder Toby Kim claims Dia surpasses proprietary offerings from competitors like ElevenLabs, as well as Google's NotebookLM AI podcast generation capabilities, and potentially even OpenAI's recently released gpt-4o-mini.

Apr 23, 2025

700

ByteDance Research Open-Sources ChatTS-14B: Native Understanding and Reasoning Over Time

ByteDance Research has announced the open-sourcing of ChatTS-14B, a 14-billion parameter large language model (LLM) specifically designed for understanding and reasoning with time series data. Released under the Apache2.0 license, ChatTS-14B's open-source release has garnered significant attention within the AI community, marking a substantial advancement in the intersection of time series analysis and generative AI. ChatTS-14B: An Intelligent Conversational Engine for Time Series. ChatTS-14B is based on Qwen2.5-1...

Apr 21, 2025

1.1k

Kimina-Prover: An Open-Source Mathematical Theorem Proving Model

The Kimi team recently released a technical report and open-sourced the preview version of Kimina-Prover, including 1.5B and 7B parameter distilled models, the Kimina-Autoformalizer-7B model for data generation, and a revised miniF2F benchmark dataset. Kimina-Prover, jointly developed by the Numina and Kimi teams, is a mathematical theorem proving model that excels in the field of formal theorem proving.

Apr 17, 2025

280

Stanford Report Confirms: Alibaba's Qwen Ranks Third Globally in Large Model Contribution, Reshaping Global Competition with Computing Power!

Stanford University's AI Index Report 2025 offers a fresh perspective on the global AI landscape. The report highlights Alibaba's significant contribution, ranking third globally among major large language models, establishing it as a leading Chinese tech company. In 2024, China contributed 15 models globally, with Alibaba contributing 6, trailing only Google and OpenAI with 7 models each. This achievement reflects Alibaba's ongoing commitment to technological innovation.

Apr 12, 2025

1.1k

Groundbreaking Advancements in AI Avatars: Talking Digital Twins Reshaping the Future of Human-Computer Interaction

Recent breakthroughs in generative AI have enabled AI avatars to not only possess lifelike appearances but also speak naturally and fluently. This technology, incorporating cutting-edge speech synthesis and facial expression generation capabilities, is rapidly blurring the lines between the digital and physical worlds, propelling AI from a behind-the-scenes tool to a direct conversational partner with humans. The emergence of these AI avatars marks a crucial step in the convergence of generative AI technologies. By seamlessly integrating highly realistic facial animation with natural speech synthesis, these avatars offer unprecedented potential for revolutionizing communication and interaction.

Apr 9, 2025

330

ByteDance Releases MegaTTS3 on Hugging Face: A Breakthrough in Lightweight Speech Synthesis

Beijing—ByteDance recently released its latest text-to-speech (TTS) model, MegaTTS3, on the Hugging Face open-source AI community. This release has quickly garnered attention from AI researchers and developers worldwide due to its breakthroughs in lightweight design and multilingual support. Based on community feedback and official information, MegaTTS3 is hailed as a significant advancement in speech synthesis. MegaTTS3's core highlights are...

Apr 3, 2025

580

AI Daily: Alibaba's Qwen Tops Global Open-Source Model Ranking; MiniMax Launches Speech-02; ChatGPT Paid Users Surge to 20 Million

Welcome to the 【AI Daily】column! Your daily guide to exploring the world of artificial intelligence. We present you with the hottest AI news, focusing on developers and helping you understand technology trends and innovative AI product applications. Discover new AI products: https://top.aibase.com/ 1. Alibaba's Qwen-2.5-Omni Tops Global Open-Source Model Ranking On April 2nd, 2024, Hugging Face released its latest large model ranking, with Alibaba's Qwen...

Apr 2, 2025

800

Alibaba's Qwen-2.5-Omni Tops Global Open-Source Model Leaderboard

Apr 2, 2025

4.5k

MiniMax Audio Launches Speech-02 Voice Model: Supports 200,000 Characters at Once

MiniMax Audio, a leading innovator in audio technology, has officially released its new Speech-02 series voice model. Supporting over 30 languages and capable of processing 200,000 characters at once, it delivers a more natural, fluent, and convenient audio experience. The new Speech-02 series is the core highlight of this update. According to the official introduction, this series has significantly improved multilingual support, enabling more accurate and native-sounding pronunciations in various languages. Even more impressively, Speech-

Apr 2, 2025

3.4k

ElevenLabs Launches World's First AI Text-to-Bark Model

ElevenLabs, a pioneer in AI audio technology, recently announced the launch of Text To Bark, the world's first AI text-to-speech model designed specifically for dogs. This innovative technology has garnered significant attention from the tech industry and pet lovers alike. It purportedly converts human-input text into highly realistic dog barks, with a claimed accuracy so high that 95% of dogs can't distinguish them from real canine vocalizations. This is considered a bold attempt to facilitate communication between humans and their pets.

Apr 2, 2025

460

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Spark-TTS: A Text-to-Speech System Supporting Zero-Shot Voice Cloning and Fine-grained Control

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Dia: A Revolutionary Open-Source TTS Model with Emotion and Non-Verbal Cues

ByteDance Research Open-Sources ChatTS-14B: Native Understanding and Reasoning Over Time

Kimina-Prover: An Open-Source Mathematical Theorem Proving Model

Stanford Report Confirms: Alibaba's Qwen Ranks Third Globally in Large Model Contribution, Reshaping Global Competition with Computing Power!

Groundbreaking Advancements in AI Avatars: Talking Digital Twins Reshaping the Future of Human-Computer Interaction

ByteDance Releases MegaTTS3 on Hugging Face: A Breakthrough in Lightweight Speech Synthesis

AI Daily: Alibaba's Qwen Tops Global Open-Source Model Ranking; MiniMax Launches Speech-02; ChatGPT Paid Users Surge to 20 Million

Alibaba's Qwen-2.5-Omni Tops Global Open-Source Model Leaderboard

MiniMax Audio Launches Speech-02 Voice Model: Supports 200,000 Characters at Once

ElevenLabs Launches World's First AI Text-to-Bark Model