Geely Auto has recently made significant strides in the field of speech synthesis, with its independently developed HAM-TTS large model surpassing the industry benchmark VALL-E, garnering widespread attention in the industry. This AI large model, named "Xingrui," has achieved notable improvements in key metrics such as pronunciation accuracy, naturalness, and speaker similarity.
The HAM-TTS model employs a token-based zero-shot text-to-speech hierarchical acoustic modeling technology, significantly enhancing user interaction experiences in smart cockpits. Under the same 400 million parameter conditions, the character error rate of the HAM-TTS model is 1.5% lower than that of VALL-E; and with the full 800 million parameters, the character error rate drops by an additional 2.3%. In terms of style consistency, pitch consistency, and overall score, the HAM-TTS model has achieved a significant improvement of 10%.
The advantages of the Xingrui model are not only reflected in its performance metrics but also in its impressive practicality. It can maintain the stability of the speaker's voice in various scenarios such as virtual character linkage, voice navigation, and news broadcasting, and intelligently adjust the tone, pitch, pauses, and emotions according to the context. Notably, the model can seamlessly switch between different languages, including dialects and foreign languages, and only requires a 3-second sample input to complete voice replication, far superior to the industry's usual requirement of over 10 seconds.
The Geely team has innovatively improved the model's performance by introducing hierarchical acoustic modeling. They have addressed the issue of inaccurate pronunciation and introduced a latent space variable sequence predictor and a text aligner, making the match between text and voice more precise, thus making the synthesized speech more natural and fluent.
This breakthrough not only showcases Geely's R&D capabilities in intelligent technology but also reflects its ambition in the AI field. Geely's Xingrui AI large model system has expanded into multiple directions including multimodal large models and language large models, laying the foundation for smart automotive technology. Meanwhile, Geely's cloud-based total computing power has increased from 8.1 quintillion operations per second last year to 10.2 quintillion operations per second, indicating its continuous investment in technology.
With the initial success of electrification, this breakthrough in the intelligent field by Geely provides new ideas and possibilities for the future development of the automotive industry. This not only redefines our perception of traditional automakers but also indicates that intelligence will become a key area of competition in the future automotive industry.
Paper link: https://arxiv.org/pdf/2403.05989