Recently, AI company Rhymes AI has officially open-sourced its advanced text-to-video model, Allegro. Allegro enables users to transform simple text descriptions into high-quality short video clips, opening up new possibilities for creators, developers, and researchers in the field of AI-generated video.
Allegro can generate high-quality videos at 720p resolution, 15 frames per second, and 6 seconds in length based on user-provided text prompts, covering a variety of film themes, from close-ups of people and animals to action scenes in various settings, almost any scene described in text.
The core technologies of Allegro include large-scale video data processing, compressing raw videos into visual tokens, and an extended video diffusion Transformer.
In terms of large-scale video data processing, Rhymes AI has designed a systematic data processing and filtering pipeline to extract training videos from raw data and developed a structured data system for multidimensional classification and clustering of the data, facilitating model training and fine-tuning.
For compressing videos into visual tokens, Allegro uses a Video Variational Autoencoder (VideoVAE) to compress raw videos into smaller visual tokens while retaining necessary details, achieving smoother and more efficient video generation. VideoVAE is built on a pre-trained image VAE and extends the spatial-temporal modeling layers, effectively leveraging spatial compression capabilities.
Regarding the extended video diffusion Transformer, Allegro's core is its extended diffusion Transformer architecture, which applies diffusion models to generate high-resolution video frames, ensuring the quality and fluidity of video motion. Allegro's backbone network is built on the DiT (Diffusion Transformer) architecture, featuring 3D RoPE position embeddings and a 3D full attention mechanism. Compared to traditional diffusion models using a UNet architecture, the Transformer structure is more conducive to model scaling. By leveraging the 3D attention mechanism, DiT can simultaneously process the spatial dimensions of video frames and their temporal evolution, providing a more nuanced understanding of motion and context.
Rhymes AI states that Allegro is just the beginning, and the team is actively developing more advanced features, including image-to-video generation, motion control, and support for longer, narrative-based, storyboard-style video generation.
To make AI-driven video creation more accessible to a wider audience, Rhymes AI has open-sourced Allegro's model weights and code, encouraging the community to explore, unleash creativity, and build upon it, aiming for collaborative progress in AI-generated video technology.
Project link: https://github.com/rhymes-ai/Allegro