In this era where the digital wave sweeps across the globe, virtual avatars have quietly become an indispensable part of our daily lives.

However, users who frequently engage in image-to-video with lip-syncing often encounter an awkward issue: no matter how realistic your "character" is generated, she immediately gives herself away when she opens her mouth.

ID Photo Portrait (1)

Image Source Note: This image was generated by AI, provided by the image licensing service Midjourney

Simply put, the voice and the visuals are completely disjointed. Everyone can tell that the voice does not belong to her, or rather, the sound heard in that context should not be like this.

Now, this embarrassing problem has finally been solved!

Recently, an innovative technology called LOOPY has emerged, breaking through the limitations of traditional virtual avatar animation and injecting unprecedented vitality into the digital world.

LOOPY is a video diffusion model driven by audio, jointly developed by ByteDance and Zhejiang University's research team. Unlike previous technologies that required complex spatial signals for assistance, LOOPY only needs a single frame of an image and audio input to bring virtual avatars to life with stunning dynamic effects.

QQ20240905-174206.jpg

The core of this technology lies in its unique long-term motion information capture module. LOOPY supports various visual and audio styles, acting like an experienced choreographer that can accurately "direct" every subtle movement of the virtual avatar according to the rhythm and emotion of the audio. This includes non-verbal actions such as sighs, emotion-driven eyebrow and eye movements, and natural head movements.

For example, in this video, the eye and neck movements of Taylor while speaking perfectly align with expectations. When you watch her talk, it feels natural and as if that's how she would really talk, including the ambient and contextual sounds that make it seem "right."

LOOPY also performs remarkably with non-realistic characters. Whether it's the subtle expressions of a singer, the synchronized changes in eyebrows and eyes with emotions, or even a gentle sigh, LOOPY can perfectly render them.

What's even more exciting is that it can generate diverse action effects for the same reference image based on different audio inputs, ranging from passionate to gentle and refined. This flexibility provides creators with limitless imagination spaces.

In practical applications, LOOPY has demonstrated outstanding performance. Through tests on multiple real-world datasets, it not only surpasses existing audio-driven portrait diffusion models in naturalness but also generates high-quality, highly realistic results in various complex scenarios.

It is particularly noteworthy that LOOPY excels in handling profile portraits, which will undoubtedly push the expressive power of virtual avatars to new heights.

The emergence of LOOPY undoubtedly opens a new door for the virtual world. It can not only excel in areas such as gaming, film production, and virtual reality to enhance user experience but also provides creators with a broader creative platform. As technology continues to advance, LOOPY's potential is being further explored, and it is likely to become a new benchmark for the future development of virtual avatar technology.

Project Address: https://loopyavatar.github.io/