MaskGCT
Zero-shot text-to-speech conversion model that does not require alignment information.
CommonProductOthersText-to-speechZero-shot learning
MaskGCT is an innovative zero-shot text-to-speech (TTS) model that addresses the challenges present in autoregressive and non-autoregressive systems by eliminating the need for explicit alignment information and phone-level duration prediction. MaskGCT employs a two-stage model: the first stage uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model; in the second stage, the model predicts acoustic tokens based on these semantic tokens. It follows a masking and prediction learning paradigm, learning to predict masked semantic or acoustic tokens based on given conditions and prompts during training. During inference, the model generates a specified length of tokens in parallel. Experiments show that MaskGCT surpasses the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility.