Recently, ByteDance developed an AI model named PersonaTalk, which can accurately dub videos while maintaining lip synchronization and perfectly matching the speaking style.
PersonaTalk is a two-stage framework based on an attention mechanism, including geometric structure and facial rendering. In the first stage, it uses a hybrid geometric estimation method to extract the speaker's facial geometric coefficients from the reference video. Then, it extracts and encodes audio features from the target audio and learns personalized speaking styles from geometric statistical features, injecting them into the audio features. Finally, it generates target geometry that is lip-synced with the target audio and retains the personalized speaking style based on the geometric coefficients of the reference video and the target audio.
In the second stage, it uses a dual attention mechanism facial renderer to synthesize the target speaker's face, employing a carefully designed reference selection strategy to generate a face that is lip-synced with the target geometry.
The model achieves highly personalized dubbing effects by learning the speaker's speaking style from the reference video and applying it to the dubbing of the target audio. Additionally, it employs a dual attention mechanism facial renderer that can sample textures for the lips and other facial areas separately, better preserving facial details and eliminating common artifacts like teeth flickering and sticking.
Experimental results show that compared to other state-of-the-art models, PersonaTalk has significant advantages in visual quality, lip synchronization accuracy, and personalization retention. Moreover, as a general model, PersonaTalk achieves performance comparable to specific character models without any fine-tuning.
Although PersonaTalk has made significant achievements in dubbing human face videos, due to the limitations of training data, its performance in driving non-human avatars (such as cartoon characters) may be slightly lower, and it may produce artifacts when dealing with large facial poses.
To prevent the misuse of this technology, ByteDance plans to restrict access to the core model to research institutions.
Project link: https://grisoon.github.io/PersonaTalk/