MaskGCT

Zero-shot text-to-speech conversion model that does not require alignment information.

CommonProductOthersText-to-speechZero-shot learning

MaskGCT is an innovative zero-shot text-to-speech (TTS) model that addresses the challenges present in autoregressive and non-autoregressive systems by eliminating the need for explicit alignment information and phone-level duration prediction. MaskGCT employs a two-stage model: the first stage uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model; in the second stage, the model predicts acoustic tokens based on these semantic tokens. It follows a masking and prediction learning paradigm, learning to predict masked semantic or acoustic tokens based on given conditions and prompts during training. During inference, the model generates a specified length of tokens in parallel. Experiments show that MaskGCT surpasses the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

MaskGCT

MaskGCT Visit Over Time

MaskGCT Visit Trend

MaskGCT Visit Geography

MaskGCT Traffic Sources

MaskGCT Alternatives

MaskGCT — Zero-shot text-to-speech conversion model that does not require alignment information.

MegaTTS 3 — A highly efficient speech synthesis model that supports Chinese, English, and speech cloning.

OpenAI.fm — Developers can interactively experience the new voice models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts in the OpenAI API.

Orpheus TTS — An open-source text-to-speech system dedicated to achieving natural human speech.

CSM 1B — CSM 1B is a text-to-speech generation model developed by Sesame, capable of generating high-quality audio.

Llasa-1B — Llasa-1B is a text-to-speech (TTS) model based on the LLaMA architecture, supporting both Chinese and English speech synthesis.

Llasa-3B — Llasa-3B is a text-to-speech synthesis model based on LLaMA that supports speech generation in both Chinese and English.

Kokoro-82M — A cutting-edge text-to-speech (TTS) model with 82 million parameters.

OuteTTS-0.2-500M — High-performance text-to-speech synthesis model

OuteTTS — An experimental text-to-speech model.

MaskGCT TTS Demo — Text-to-speech demonstration based on the MaskGCT model.

F5-TTS — A high-quality text-to-speech synthesis model based on deep learning.

VALL-E 2 — A speech synthesis technology developed by Microsoft Research Asia

Bailing-TTS — A large-scale text-to-speech model for generating high-quality Chinese dialect voices.

ToucanTTS — Multilingual controllable text-to-speech synthesis toolkit

Aura TTS Demo by Deepgram — Deepgram's Aura TTS demo showcases advanced speech synthesis technology.

NaturalSpeech 3 — NaturalSpeech 3 is a zero-shot speech synthesis system that utilizes a decompositional encoder-decoder and diffusion model to generate natural-sounding speech.

Whisper Speech — Open-source text-to-speech system

StyleTTS 2 — Human-level text-to-speech synthesis model

EaseVoice Trainer — A simple and easy-to-use speech cloning and speech model training tool.

Podcastle AI Voices — Converts text into natural-sounding speech, boasting over 1000 realistic AI voices.

Sesame CSM — A model for generating conversational speech, supporting high-quality speech generation from text and audio input.

Zonos TTS — Zonos TTS is a high-quality AI text-to-speech technology that supports multiple languages, emotion control, and zero-shot text-to-speech cloning.

Sesame AI — Sesame AI is an advanced text-to-speech platform that generates natural conversational speech with emotional intelligence.

KokoroTTS — Kokoro TTS is a high-performance text-to-speech tool that supports multiple languages and voice blending, free for commercial use.

Spark-TTS — Spark-TTS is a highly efficient single-stream decoupled speech synthesis model based on large language models.

Llasa — A TTS base model based on the Llama framework, compatible with 160,000 hours of tokenized speech data.

Level-Navi Agent-Search — Level-Navi Agent is a ready-to-use framework that utilizes large language models for in-depth query understanding and precise search.

Lemonfox.ai Text-to-Speech API — A low-cost, high-quality text-to-speech API supporting multiple languages and accents, easy to integrate.

Octave TTS — Octave TTS is the first speech synthesis model capable of understanding the meaning of text, generating speech that is rich in emotion and style.