Spark-TTS

Spark-TTS is a highly efficient single-stream decoupled speech synthesis model based on large language models.

CommonProductProductivitySpeech SynthesisLarge Language Model

Spark-TTS is a highly efficient text-to-speech synthesis model based on large language models, featuring single-stream decoupled speech tokens. Leveraging the power of large language models, it directly reconstructs audio predicted from code, omitting the additional acoustic feature generation model, thus improving efficiency and reducing complexity. This model supports zero-shot text-to-speech synthesis, enabling cross-lingual and code-switching scenarios, making it ideal for speech synthesis applications requiring high naturalness and accuracy. It also supports virtual voice creation; users can generate different voices by adjusting parameters such as gender, pitch, and speaking rate. The model aims to address the inefficiencies and complexities of traditional speech synthesis systems, providing a highly efficient, flexible, and powerful solution for research and production. Currently, the model is primarily intended for academic research and legitimate applications such as personalized speech synthesis, assistive technologies, and language research.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Spark-TTS

Spark-TTS Visit Over Time

Spark-TTS Visit Trend

Spark-TTS Visit Geography

Spark-TTS Traffic Sources

Spark-TTS Alternatives

Spark-TTS — Spark-TTS is a highly efficient single-stream decoupled speech synthesis model based on large language models.

VideoPoet — A large language model for video generation

EaseVoice Trainer — A simple and easy-to-use speech cloning and speech model training tool.

WeClone — Fine-tune a large language model using WeChat chat logs to achieve high-quality voice cloning.

Dream 7B — Dream 7B is a state-of-the-art open diffusion large language model.

MegaTTS 3 — A highly efficient speech synthesis model that supports Chinese, English, and speech cloning.

OpenAI.fm — Developers can interactively experience the new voice models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts in the OpenAI API.

Orpheus TTS — An open-source text-to-speech system dedicated to achieving natural human speech.

CSM 1B — CSM 1B is a text-to-speech generation model developed by Sesame, capable of generating high-quality audio.

Sesame CSM — A model for generating conversational speech, supporting high-quality speech generation from text and audio input.

Sesame AI — Sesame AI is an advanced text-to-speech platform that generates natural conversational speech with emotional intelligence.

Argo — Easily build your own large language model. Exclusive intelligence, all locally.

NotaGen — NotaGen is a model for symbolic music generation, employing a large language model training paradigm and focusing on generating high-quality classical music scores.

AoT — Atom of Thoughts (AoT) is a framework for improving the reasoning performance of large language models.

Llasa — A TTS base model based on the Llama framework, compatible with 160,000 hours of tokenized speech data.

Level-Navi Agent-Search — Level-Navi Agent is a ready-to-use framework that utilizes large language models for in-depth query understanding and precise search.

Octave TTS — Octave TTS is the first speech synthesis model capable of understanding the meaning of text, generating speech that is rich in emotion and style.

IndexTTS — An industrial-grade, controllable, and efficient zero-shot text-to-speech system

M2RAG — A benchmark codebase for retrieval-augmented generation in multimodal contexts.

SWE-RL — Enhancing the reasoning capabilities of large language models in open-source software evolution through reinforcement learning.

TableGPT2-7B — TableGPT2-7B is a large language model specializing in tabular data processing, suitable for data analysis and business intelligence tasks.

Coding-Tutor — Explores the potential of large language models as programming tutoring tools and proposes the Trace-and-Verify workflow.

Tbox - AI Powered Intelligent Agent Builder — Leveraging Alipay's lifestyle scenarios and leading large language model technology, Tbox enables businesses to quickly build professional-grade intelligent agents.

MoBA — MoBA is a Mixed Block Attention mechanism for long text contexts designed to improve the efficiency of large language models.

Goedel-Prover — Goedel-Prover is an open-source automated theorem proving model focused on the formal verification of mathematical problems.

OmniParser-v2.0 — OmniParser is a versatile screen parsing tool that converts UI screenshots into a structured format, improving the performance of LLM-based UI agents.

Xingsheng AI — Xingsheng AI is an AI podcast generator that can create AI podcasts from any content.

LLaSA_training — LLaSA: Extending training and inference computational requirements for LLaMA-based speech synthesis

Mistral-Small-24B-Instruct-2501 — Mistral Small 24B is a multilingual, high-performance instruction-tuned large language model suitable for various application scenarios.

MNN Large Model Android App — A fully functional Android app supporting multimodal capabilities with a large language model.