StyleTTS 2

Human-level text-to-speech synthesis model

CommonProductMusicText-to-speechSpeech synthesis
StyleTTS 2 is a text-to-speech (TTS) model that utilizes large speech language models (SLMs) for style diffusion and adversarial training, achieving human-level TTS synthesis. It employs a diffusion model to model style as a latent stochastic variable, generating the most appropriate style for the given text without relying on voice references. Furthermore, we utilize large pre-trained SLMs (such as WavLM) as discriminators and incorporate our innovative differentiable duration modeling for end-to-end training, enhancing the naturalness of the synthesized speech. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches them on the multi-speaker VCTK dataset, garnering recognition from native English-speaking evaluators. Additionally, when trained on the LibriTTS dataset, our model outperforms prior publicly available zero-shot extension models. By demonstrating the potential of style diffusion and adversarial training with large SLMs, this work achieves human-level TTS synthesis on both single and multi-speaker datasets.
Visit

StyleTTS 2 Visit Over Time

Monthly Visits

494758773

Bounce Rate

37.69%

Page per Visit

5.7

Visit Duration

00:06:29

StyleTTS 2 Visit Trend

StyleTTS 2 Visit Geography

StyleTTS 2 Traffic Sources

StyleTTS 2 Alternatives