Recently, Tongyi announced the open-sourcing of its latest large-scale video generation model, Wanxiang Wan2.1. Wan2.1 is an AI model specializing in high-quality video generation. Its exceptional performance in handling complex movements, accurately reflecting real-world physics, enhancing cinematic quality, and improving instruction following makes it a leading choice for creators, developers, and businesses embracing the AI era.

微信截图_20250226075714.png

In the authoritative benchmark Vbench, Tongyi Wanxiang Wan2.1 achieved a top score of 86.22%, significantly outperforming other well-known video generation models domestically and internationally, such as Sora, Minimax, Luma, Gen3, and Pika. This achievement is attributed to Wan2.1's utilization of the mainstream DiT and linear noise trajectory Flow Matching paradigms, along with a series of technological innovations that have led to significant improvements in generation capabilities. Specifically, a self-developed, highly efficient 3D causal VAE module achieves 256x lossless video latent space compression and supports efficient encoding and decoding of videos of arbitrary length through a feature caching mechanism, while simultaneously reducing inference memory usage by 29%. Furthermore, the model achieves a video reconstruction speed 2.5 times faster than existing state-of-the-art methods on a single A800 GPU, demonstrating a significant performance advantage.

Wan2.1's video Diffusion Transformer architecture effectively models long-term spatiotemporal dependencies through a Full Attention mechanism, generating high-quality and temporally consistent videos. Its training strategy employs a six-stage stepwise training method, gradually transitioning from pre-training on low-resolution image data to training on high-resolution video data, and finally fine-tuning with high-quality annotated data to ensure excellent performance across different resolutions and complex scenarios. In data processing, Wan2.1 employs a four-step data cleaning process focusing on basic dimensions, visual quality, and motion quality to filter high-quality and diverse data from a noisy initial dataset, promoting effective training.

微信截图_20250226075708.png

In terms of model training and inference efficiency optimization, Wan2.1 employs various strategies. During training, different distributed strategies are used for the text, video encoding modules, and DiT modules, respectively, and efficient strategy switching avoids redundant computation. For memory optimization, a layered memory optimization strategy is employed, combined with the PyTorch memory management mechanism to address memory fragmentation issues. During inference, a combination of FSDP and 2D CP is used for multi-GPU distributed acceleration, and quantization methods are used to further enhance performance.

Currently, Tongyi Wanxiang Wan2.1 is open-sourced on platforms such as GitHub, Hugging Face, and Moda Community, supporting various mainstream frameworks. Developers and researchers can quickly experience it via Gradio or utilize xDiT for parallel inference acceleration to improve efficiency. The model is also being rapidly integrated into Diffusers and ComfyUI to simplify one-click inference and deployment, lowering the development threshold and providing users with flexible options for both rapid prototyping and efficient production deployment.