ByteDance recently open-sourced an innovative technology called LatentSync, which is an end-to-end lip-sync framework based on an audio-conditioned latent diffusion model. This technology achieves precise synchronization between lip movements of characters in videos and audio without any intermediate motion representations. Unlike previous lip-sync methods that rely on pixel-space diffusion or two-stage generation, LatentSync directly leverages the powerful capabilities of Stable Diffusion, enabling more effective modeling of complex audiovisual associations.
Research has found that diffusion-based lip-sync methods perform poorly in terms of temporal consistency due to inconsistencies in the diffusion process between different frames. To address this issue, LatentSync introduces a Temporal Representation Alignment (TREPA) technique. TREPA utilizes temporal representations extracted from large self-supervised video models to align generated frames with real frames, thereby enhancing temporal consistency while maintaining lip-sync accuracy.
Additionally, the research team delved into the convergence issues of SyncNet and, through extensive empirical studies, identified key factors affecting SyncNet's convergence, including model architecture, training hyperparameters, and data preprocessing methods. By optimizing these factors, SyncNet's accuracy on the HDTF test set improved significantly from 91% to 94%. Since the overall training framework of SyncNet was not altered, these findings can also be applied to other lip-sync and audio-driven portrait animation methods that utilize SyncNet.
Advantages of LatentSync
End-to-end framework: Generates synchronized lip movements directly from audio without the need for intermediate motion representations.
High-quality generation: Utilizes the powerful capabilities of Stable Diffusion to generate dynamic and realistic speaking videos.
Temporal consistency: Enhances temporal consistency between video frames through the TREPA technique.
SyncNet optimization: Addresses the convergence issues of SyncNet, significantly improving lip-sync accuracy.
How It Works
The core of LatentSync is based on image-to-image restoration techniques, requiring input of masked images as references. To integrate the visual features of the original video's face, the model also inputs reference images. These input features are concatenated and then processed by a U-Net network.
The model uses a pre-trained audio feature extractor, Whisper, to extract audio embeddings. Lip movements may be influenced by the audio from surrounding frames, so the model bundles audio from multiple surrounding frames as input to provide more temporal information. Audio embeddings are integrated into the U-Net through cross-attention layers.
To address the issue of SyncNet requiring image-space input, the model first predicts in the noise space and then obtains an estimated clean latent space through a single-step method. Research has shown that training SyncNet in pixel space yields better results than training in latent space, possibly because information about the lip area is lost during the VAE encoding process.
The training process is divided into two phases: in the first phase, U-Net learns visual features without pixel space decoding and adding SyncNet loss. In the second phase, pixel space supervision methods are used to add SyncNet loss, and LPIPS loss is employed to improve the visual quality of the images. To ensure the model correctly learns temporal information, the input noise must also maintain temporal consistency, and a mixed noise model is used. Additionally, affine transformations are employed during data preprocessing to achieve frontalization of faces.
TREPA Technique
TREPA enhances temporal consistency by aligning the temporal representations of generated image sequences with those of real image sequences. This method utilizes a large self-supervised video model, VideoMAE-v2, to extract temporal representations. Unlike methods that only use distance loss between images, temporal representations capture the temporal correlations in image sequences, thereby improving overall temporal consistency. Research has found that TREPA not only does not harm lip-sync accuracy but can actually enhance it.
SyncNet Convergence Issues
Research has found that SyncNet's training loss tends to hover around 0.69, failing to decrease further. Through extensive experimental analysis, the research team discovered that batch size, input frame count, and data preprocessing methods significantly impact SyncNet's convergence. Model architecture also affects convergence, but to a lesser extent.
Experimental results show that LatentSync outperforms other state-of-the-art lip-sync methods across multiple metrics. Particularly in terms of lip-sync accuracy, this can be attributed to its optimized SyncNet and audio cross-attention layers, which better capture the relationship between audio and lip movements. Furthermore, with the adoption of the TREPA technique, LatentSync's temporal consistency has also seen significant improvement.
Project address: https://github.com/bytedance/LatentSync