In the context of rapid advancements in voice synthesis technology, the issue of voice forgery is becoming increasingly severe, posing significant threats to user privacy and social security. Recently, Zhejiang University's Laboratory of Intelligent System Security and Tsinghua University jointly released a new voice forgery detection framework named "SafeEar."

This framework is dedicated to achieving efficient forgery detection while protecting the privacy of voice content, fully addressing the challenges posed by voice synthesis.

The concept of SafeEar involves designing a decoupled model based on neural audio codecs, cleverly separating the acoustic and semantic information of speech. This means that SafeEar relies solely on acoustic information for forgery detection without accessing the complete content of the audio, effectively preventing privacy leakage.

The entire framework is divided into four main parts.

Firstly, the front-end decoupling model is responsible for extracting target acoustic features from the input speech; secondly, the bottleneck layer and confusion layer enhance resistance to content theft by reducing dimensions and scrambling acoustic features; thirdly, the forgery detector uses a Transformer classifier to determine if the audio has been forged; finally, the real-environment enhancement module further improves the model's detection by simulating different audio environments.

image.png

Project entry: https://github.com/LetterLiGo/SafeEar?tab=readme-ov-file

Experiments on multiple benchmark datasets have shown that SafeEar's error rate is as low as 2.02%. This indicates its high effectiveness in identifying deepfake audio! Moreover, SafeEar can protect audio content in five languages, making it difficult to be parsed by machines or human ears, with a word error rate as high as 93.93%. Additionally, tests have shown that attackers cannot recover the protected voice content, demonstrating the technology's advantages in privacy protection.

Furthermore, the SafeEar team has constructed a dataset containing 1.5 million multilingual audio data entries, covering English, Chinese, German, French, and Italian, among others, providing rich foundational resources for future voice forgery detection and research.

The introduction of SafeEar not only brings new solutions to the field of voice forgery detection but also paves the way for protecting users' voice privacy.

Key points:

  • 🎤 **Innovative SafeEar Framework**: Detects deepfake audio without leaking voice content, protecting user privacy.
  • 🔍 **Multi-head Self-attention Mechanism**: Enhances the ability to identify deepfake audio without semantic cues, with an error rate as low as 2.02%.
  • 🔒 **Audio Content Protection**: Effectively safeguards audio in multiple languages from being parsed, with a word error rate as high as 93.93%.