Jointly Developed by Zhejiang University and Tsinghua University! Voice Forgery Detection Framework SafeEar Achieves an Error Rate as Low as 2.02%

AIbase基地

Published inAI News · 4 min read · Sep 26, 2024

220

In the context of rapid advancements in voice synthesis technology, the issue of voice forgery is becoming increasingly severe, posing significant threats to user privacy and social security. Recently, Zhejiang University's Laboratory of Intelligent System Security and Tsinghua University jointly released a new voice forgery detection framework named "SafeEar."

This framework is dedicated to achieving efficient forgery detection while protecting the privacy of voice content, fully addressing the challenges posed by voice synthesis.

The concept of SafeEar involves designing a decoupled model based on neural audio codecs, cleverly separating the acoustic and semantic information of speech. This means that SafeEar relies solely on acoustic information for forgery detection without accessing the complete content of the audio, effectively preventing privacy leakage.

The entire framework is divided into four main parts.

Firstly, the front-end decoupling model is responsible for extracting target acoustic features from the input speech; secondly, the bottleneck layer and confusion layer enhance resistance to content theft by reducing dimensions and scrambling acoustic features; thirdly, the forgery detector uses a Transformer classifier to determine if the audio has been forged; finally, the real-environment enhancement module further improves the model's detection by simulating different audio environments.

Project entry: https://github.com/LetterLiGo/SafeEar?tab=readme-ov-file

Experiments on multiple benchmark datasets have shown that SafeEar's error rate is as low as 2.02%. This indicates its high effectiveness in identifying deepfake audio! Moreover, SafeEar can protect audio content in five languages, making it difficult to be parsed by machines or human ears, with a word error rate as high as 93.93%. Additionally, tests have shown that attackers cannot recover the protected voice content, demonstrating the technology's advantages in privacy protection.

Furthermore, the SafeEar team has constructed a dataset containing 1.5 million multilingual audio data entries, covering English, Chinese, German, French, and Italian, among others, providing rich foundational resources for future voice forgery detection and research.

The introduction of SafeEar not only brings new solutions to the field of voice forgery detection but also paves the way for protecting users' voice privacy.

Key points:
🎤 **Innovative SafeEar Framework**: Detects deepfake audio without leaking voice content, protecting user privacy.
🔍 **Multi-head Self-attention Mechanism**: Enhances the ability to identify deepfake audio without semantic cues, with an error rate as low as 2.02%.
🔒 **Audio Content Protection**: Effectively safeguards audio in multiple languages from being parsed, with a word error rate as high as 93.93%.

Open Source Revolution! Kyutai TTS Launches: Ultra-Low Latency Speech Synthesis, the New Era of AI Voice is Here!

Recently, the French AI laboratory Kyutai announced the official open source of its new text-to-speech model, Kyutai TTS, providing global developers and researchers with a high-performance, low-latency speech synthesis solution. This breakthrough release not only promotes the development of open-source AI technology but also opens up new possibilities for multilingual voice interaction applications. AIbase provides an exclusive analysis of this technological highlight and its potential impact. Ultra-low latency, a new experience in real-time interaction. Kyutai TTS has become an industry standout with its exceptional performance.

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Welcome to the [AI Daily] section! Here is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and learn about innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1、Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand audio and directly generate natural speech. Step-Audio-AQAA is an open source end-to-end speech large model,

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

Zhejiang University and Alibaba have jointly launched the new audio-driven model OmniAvatar, marking a new height in digital human technology. This model is driven by audio and can generate natural and smooth full-body digital human videos, especially showing outstanding performance in singing scenarios, with mouth movements and audio lip synchronization being precise and realistic. OmniAvatar supports fine control of generation details through text prompts, allowing users to customize the range of character movements, background environment, and emotional expressions, demonstrating a high level of flexibility. In addition, this model can generate virtual characters interacting with objects

Honor Launches a New Battle in AI Voice Technology, the World's First Edge-side Voice Large Model to Be Launched!

Honor's official Weibo account @MagicOS announced that Honor has successfully deployed the world's first edge-side voice large model. This technological advancement is not only a breakthrough for Honor, but also hailed as a 'renewal of AI voice technology'. This significant achievement will make its debut on the overseas version of the upcoming Honor Magic V5. Honor's technological innovation is the result of its in-depth efforts in the field of artificial intelligence. It is reported that Honor has published two academic papers at the prestigious international conference InterSpeech, which have attracted widespread attention from the academic community.

TEN Agent Open Source TEN VAD and Turn Detection Enable Ultra-Low Latency for Speech AI

The TEN Agent team recently announced that its core models **TEN Voice Activity Detection (VAD)** and **TEN Turn Detection** are now open source, providing powerful technical support for building real-time, multimodal speech AI agents. This move marks a significant advancement in the TEN framework's efforts to promote the democratization and open-source collaboration of speech interaction technology. The following is the latest information compiled by AIbase, offering an in-depth analysis of these two core models.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Jointly Developed by Zhejiang University and Tsinghua University! Voice Forgery Detection Framework SafeEar Achieves an Error Rate as Low as 2.02%

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Open Source Revolution! Kyutai TTS Launches: Ultra-Low Latency Speech Synthesis, the New Era of AI Voice is Here!

Stability AI Opensources Stable Audio Open Small, Turning Your Phone into an Audio Creation Wizard

Baidu Launches the World's First Chinese Audio-Visual Generation Model MuseSteamer, Revolutionizing the Creative Process

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Amazon Alexa + Assistant Users Exceed Millions, Smart Voice Experience Upgraded

Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand Audio and Generate Natural Speech Directly

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

Honor Launches a New Battle in AI Voice Technology, the World's First Edge-side Voice Large Model to Be Launched!

TEN VAD Shocks Open Source: Enterprise-Level Speech Detection Tool, Creating a Super Intelligent AI Voice Assistant!

TEN Agent Open Source TEN VAD and Turn Detection Enable Ultra-Low Latency for Speech AI