With the continuous advancement of technology, artificial intelligence is no longer just a cold machine; it is becoming increasingly human-like. Imagine your intelligent assistant not only speaking fluent Mandarin but also communicating with you in your familiar hometown dialect, creating a truly intimate experience. The emergence of Bailing-TTS technology is turning this imagination into reality.

image.png

In the world of artificial intelligence, Text-to-Speech (TTS) technology is a significant field. It aims to enable machines to convert text information into speech that sounds like it's coming from a real person. With the rapid development of neural networks and deep learning, we can now train voice libraries that are close to human level and develop corresponding TTS models. However, most existing systems can only generate non-dialect speech, and there is still room for improvement in speech quality.

image.png

The advent of Bailing-TTS technology marks a new breakthrough in the field of dialect speech synthesis. This technology is based on a multi-layer autoregressive transformer model, trained on a large dataset including rich dialect data, using a continuous semi-supervised learning strategy, and a dialect-specific mixture-of-experts network architecture and multi-stage training strategy, effectively generating Chinese dialect speech from text.

The architecture of Bailing-TTS includes several key components:

Continuous semi-supervised learning: Promotes weak alignment between two modalities through spontaneous, expressive text and speech tokens.

Dialect-specific mixture-of-experts network architecture: Designs a mixture-of-experts architecture that learns a unified representation of multiple Chinese dialects and specific representations for each dialect.

Reinforcement learning-based hierarchical post-training extension technique: Generates high-quality speech in multiple Chinese dialects through four training stages, including pre-training, fine-tuning, and reinforcement learning-based strategies.

Researchers have conducted extensive experimental evaluations of Bailing-TTS, including training details, evaluation datasets, and evaluation metrics. The evaluation results show that the dialect speech generated by Bailing-TTS is close to human speech in both naturalness and quality.

Bailing-TTS not only achieves technical breakthroughs but also has a wide range of practical applications. Whether it's providing a richer chat service experience or promoting the dissemination of dialect culture, Bailing-TTS shows great potential.

Although Bailing-TTS has achieved initial success, there is still room for exploration in emotional speech synthesis and multimodal support. Researchers plan to develop the next-generation Bailing-TTS model to generate high-quality audio (speech/music) from video and text inputs and explore the possibility of simultaneously generating high-quality audio and video.