MaskGCT

Zero-shot text-to-speech conversion model that does not require alignment information.

CommonProductOthersText-to-speechZero-shot learning
MaskGCT is an innovative zero-shot text-to-speech (TTS) model that addresses the challenges present in autoregressive and non-autoregressive systems by eliminating the need for explicit alignment information and phone-level duration prediction. MaskGCT employs a two-stage model: the first stage uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model; in the second stage, the model predicts acoustic tokens based on these semantic tokens. It follows a masking and prediction learning paradigm, learning to predict masked semantic or acoustic tokens based on given conditions and prompts during training. During inference, the model generates a specified length of tokens in parallel. Experiments show that MaskGCT surpasses the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility.
Visit

MaskGCT Visit Over Time

Monthly Visits

404

Bounce Rate

38.07%

Page per Visit

1.9

Visit Duration

00:01:58

MaskGCT Visit Trend

MaskGCT Visit Geography

MaskGCT Traffic Sources

MaskGCT Alternatives