Recently, researchers have introduced a new technology called JoyVASA, aimed at enhancing audio-driven image animation effects. With the continuous development of deep learning and diffusion models, audio-driven portrait animation has made significant progress in video quality and lip-sync accuracy. However, the complexity of existing models has increased efficiency issues in training and inference, while also limiting video duration and frame continuity.

JoyVASA employs a two-stage design. The first stage introduces a decoupled facial representation framework that separates dynamic facial expressions from static three-dimensional facial representations.

This separation allows the system to combine any static three-dimensional facial model with dynamic action sequences, enabling the generation of longer animated videos. In the second stage, the research team trains a diffusion transformer that can directly generate action sequences from audio cues, independent of character identity. Finally, based on the generator trained in the first stage, the system renders high-quality animation effects using the three-dimensional facial representation and the generated action sequences as inputs.

image.png

It is worth noting that JoyVASA is not limited to portrait animation; it can also seamlessly animate animal faces. This model is trained on a mixed dataset that combines private Chinese data with public English data, demonstrating good multilingual support capabilities. Experimental results validate the effectiveness of this method, and future research will focus on improving real-time performance and refining expression control, further expanding the application of this framework in image animation.

The emergence of JoyVASA marks a significant breakthrough in audio-driven animation technology, opening up new possibilities in the animation field.

Key Points:

🎨 The JoyVASA technology generates longer animated videos by decoupling facial expressions from three-dimensional models.  

🔊 This technology can generate action sequences based on audio cues, possessing dual capabilities for both character and animal animation.  

🌐 JoyVASA is trained on Chinese and English datasets, providing multilingual support to serve global users.