In the rapidly advancing era of artificial intelligence, voice synthesis and conversion technologies are evolving at an unprecedented pace, delivering incredibly realistic and natural audio experiences. However, these advancements also introduce potential security risks, particularly with the "voice cloning" technology that could be exploited by malicious actors to threaten personal privacy and social stability.
To address this challenge, Zhejiang University's Intelligent System Security Lab and Tsinghua University have jointly developed a revolutionary voice forgery detection framework—SafeEar. This framework not only efficiently detects forged audio but also safeguards users' voice privacy during the detection process, achieving dual protection of security and privacy.
The core technology of SafeEar lies in its use of a neural audio codec-based decoupling model. This innovative design separates the acoustic features from the semantic information of the voice, relying solely on acoustic features for forgery detection. This not only significantly enhances detection accuracy but also ensures that the voice content is not leaked during the process, effectively protecting user privacy.
The framework includes modules such as the front-end decoupling model, bottleneck layer, obfuscation layer, forgery detector, and real-environment enhancement. Through the collaborative operation of these modules, SafeEar demonstrates exceptional detection capabilities against various forgery techniques, with a false alarm rate as low as 2.02%, nearly reaching the level of the most advanced technologies currently available. Furthermore, experiments have shown that attackers cannot recover the original voice content from the acoustic information, fully demonstrating SafeEar's outstanding performance in privacy protection.
The front-end module of SafeEar employs an innovative decoupling model that effectively distinguishes between acoustic and semantic information during the separation and reconstruction of voice features. Subsequently, the bottleneck and obfuscation layers further protect voice information through dimensionality reduction and random obfuscation, effectively preventing the extraction of real information even when faced with the most advanced voice recognition models.
In terms of forgery detection, SafeEar utilizes an acoustic input-based Transformer classifier, enhancing the precision and efficiency of the detection. Additionally, by simulating audio conditions under various environments with multiple audio codecs, SafeEar also improves the model's environmental adaptability.
After a series of rigorous experimental tests, SafeEar not only surpasses many traditional detection methods but also sets a new standard in the field of audio forgery detection. More importantly, SafeEar can protect users' voice privacy in real-time during practical applications, providing strong support for the secure development of intelligent voice services.
Through this technology, Zhejiang University and Tsinghua University have not only pioneered a new field in voice forgery detection but also constructed a rich audio dataset containing various languages and vocoders. This lays a solid foundation for future research and applications, ensuring that users can enjoy convenient voice services while also receiving better privacy protection.
The advent of SafeEar undoubtedly provides us with a powerful tool to address privacy challenges in the AI era, allowing us to enjoy the convenience of technology while better protecting our privacy security.
Paper Link: https://safeearweb.github.io/Project/files/SafeEar_CCS2024.pdf