In the realm of real-time communication, whether it's through phone calls or video conferences, voice is a crucial tool for self-expression. But have you ever wondered what it would be like if we could change the timbre of a speaker's voice in real time without affecting the language content and prosody? The advent of StreamVC technology makes this possible.

StreamVC is an innovative voice conversion solution that can match the timbre of the target voice while preserving the source speech content and prosody. Unlike traditional methods, StreamVC generates result waveforms with low latency on the input signal, even on mobile platforms, making it suitable for real-time communication scenarios such as phone calls and video conferences, as well as voice anonymization in these scenarios.

Technical Highlights:

Real-Time Performance: StreamVC achieves a low latency inference of 70.8 milliseconds on mobile devices.

High-Quality Speech Synthesis: Utilizing the architecture and training strategies of the SoundStream neural audio codec, it enables lightweight high-quality speech synthesis.

Pitch Stability: By introducing whitened fundamental frequency (f0) information, it enhances pitch consistency without revealing the source speaker's timbre information.

image.png

StreamVC draws inspiration from Soft-VC and SoundStream. It uses discrete speech units extracted by the HuBERT model as the prediction targets for the content encoder network. The architecture and training strategies of the content encoder and decoder are based on the SoundStream neural audio codec design to achieve high-quality causal audio synthesis.

StreamVC has been compared with existing technologies in several benchmark tests, including naturalness, intelligibility, speaker similarity, and pitch consistency. Experimental results show that StreamVC performs exceptionally well in maintaining the source language's pitch and can match the speaker similarity of fine-tuned models.

StreamVC demonstrates that efficient voice conversion with low latency on mobile devices is entirely feasible. HuBERT-derived soft speech units can be learned through a streamable causal convolutional neural network architecture, and injecting whitened f0 information into the decoder is crucial for providing high-quality outputs.

Paper Address: https://arxiv.org/pdf/2401.03078