Recently, Spark-TTS, an advanced text-to-speech system, has sparked significant discussion within the AI community. According to recent X posts and research, this system stands out for its zero-shot voice cloning and fine-grained voice control capabilities, representing a major breakthrough in speech synthesis.

QQ_1741231726997.png

Leveraging the power of large language models (LLMs), this system aims for highly accurate and natural speech synthesis, suitable for both research and commercial applications. Spark-TTS is designed for simplicity and efficiency. Built entirely on Qwen2.5, it eliminates the complex process of requiring additional generative models. Unlike others, Spark-TTS directly reconstructs audio from the code predicted by the LLM, significantly simplifying audio generation, improving efficiency, and reducing technical complexity.

Beyond efficient audio generation, Spark-TTS boasts excellent voice cloning capabilities. It supports zero-shot voice cloning, meaning it can successfully replicate a speaker's voice even without training data specific to that speaker.

Key Features of Spark-TTS:

Zero-shot Voice Cloning: Generates a speaker's voice style without needing training data, ideal for rapid personalization.

Fine-grained Voice Control: Users can precisely adjust speech rate and pitch, such as speeding up/slowing down speech or changing the voice's intonation.

Cross-lingual Generation: Supports multiple languages, including English and Chinese, expanding its global applicability.

Its speech quality is considered highly natural, particularly suitable for audiobook production, a point confirmed by user feedback.

Technical Architecture

Spark-TTS is based on the BiCodec single-stream speech codec. This codec decomposes speech into two tokens:

Low-bitrate semantic tokens, responsible for linguistic content.

Fixed-length global tokens, responsible for speaker attributes.

This separation allows for flexible adjustment of speech characteristics. Combined with Qwen-2.5's Chain-of-Thought (CoT) technology, it further enhances the quality and controllability of speech generation. Qwen-2.5, a large language model (LLM), provides powerful semantic understanding.

Spark-TTS also excels in language support. It handles both Chinese and English simultaneously, maintaining high naturalness and accuracy in cross-lingual synthesis. Furthermore, users can customize the virtual speaker by adjusting parameters like gender, tone, and speech rate.

Project: https://github.com/SparkAudio/Spark-TTS