Alibaba's open-source models have always attracted significant attention. The Qwen series, released in June last year, has received positive feedback from the developer community, with its 72B and 110B models frequently topping Hugging Face's open-source model rankings. However, DeepSeek-V3, released in December of the same year, surpassed the Qwen series in its debut month.
According to the latest rankings from the open-source community Hugging Face, Alibaba's Wanxiang large model, just open-sourced a week ago, has successfully topped both the trending model and model space rankings, leaving DeepSeek-R1 behind. Currently, Wanxiang 2.1 (Wan2.1) has accumulated over one million downloads across Hugging Face and the ModelScope community. Alibaba has open-sourced two parameter specifications, 14B and 1.3B, supporting both text-to-video and image-to-video tasks.
Wan 2.1 Overview
Wan 2.1 is a comprehensive and open-source video foundation model developed by Alibaba Group's DAMO Academy, aiming to push the boundaries of video generation technology. Built upon the mainstream diffusion Transformer architecture, it leverages innovative technologies such as a novel spatiotemporal variational autoencoder (VAE), a scalable pre-training strategy, large-scale data construction, and automated evaluation metrics to enhance the model's generation capabilities, performance, and versatility.
The model includes multiple versions with different parameter sizes, such as T2V-1.3B and T2V-14B (text-to-video models), and I2V-14B-720P and I2V-14B-480P (image-to-video models), catering to diverse user needs and application scenarios.
Wan 2.1 Key Features
- Superior Performance Outperforming Peers: In numerous benchmark tests, Wan 2.1 consistently surpasses existing open-source models and top commercial solutions, achieving industry-leading levels in video generation quality, detail, and realism. For example, it topped the VBench leaderboard with a total score of 86.22%, beating renowned models like Sora and HunyuanVideo.
- Supports Consumer-Grade GPU Operation: The T2V-1.3B version is hardware-friendly, requiring only 8.19GB of VRAM, allowing it to run on consumer-grade GPUs like the RTX 4090. On an RTX 4090, it can generate a 5-second 480P video in approximately 4 minutes, rivaling some closed-source models in performance and lowering the barrier to entry for individual developers and researchers.
- Comprehensive Multi-Task Coverage: Possesses strong multi-tasking capabilities, covering text-to-video (T2V), image-to-video (I2V), video editing, text-to-image (T2I), and video-to-audio (V2A) functionalities. Users can generate videos from text descriptions, transform static images into dynamic videos, edit existing videos, generate images from text, and automatically match audio to videos.
- Unique Advantages in Visual Text Generation: It's the first video model to support generating Chinese and English text within videos, with rich text effects that adapt to the scene and carrier, moving with the carrier. It accurately generates various text styles, from special effects fonts and poster fonts to text in real-world scenes, enriching video creation.
- Accurate Reproduction of Complex Movements: Excels at generating realistic videos with complex movements, accurately depicting rotations, jumps, dance moves, rapid object movement, and scene transitions. Wan 2.1 excels at rendering complex motion scenarios such as synchronized movements in a hip-hop dance, smooth basketball shots, and a dog running naturally in the snow.
- High-Fidelity Physical Simulation: Accurately simulates real-world physics and interactions between objects. Video generation realistically depicts collisions, rebounds, cutting effects, liquid flow, and light and shadow changes. For example, it simulates the dynamic trace of milk flowing out of an overturned glass and the interaction between a strawberry and water when it's submerged, making the generated videos more realistic.
- Cinema-Quality Rendering: Generates videos with cinematic quality, featuring rich textures and diverse stylistic effects. By adjusting parameters and settings, different visual styles such as retro, sci-fi, and realism can be achieved, providing users with a high-quality visual experience. For example, a simulation of a drone flying through skyscrapers at night can realistically render complex lighting effects and architectural styles, creating a stunning visual atmosphere.
- Precise Adherence to Long Text Instructions: Strong comprehension of complex long text instructions, generating videos that precisely match the description, ensuring detail completeness. Whether it's multi-subject movement scenes or complex environment construction and atmosphere creation, Wan 2.1 accurately captures the requirements. For example, based on the long text "A lively party scene, a group of young people of diverse ethnicities are dancing enthusiastically in the spacious and bright living room...", it can generate a vivid video that matches the description, accurately depicting the characters, movements, and scene atmosphere.
Application Scenarios
- Advertisement Production: Advertising companies can use Wan 2.1 to quickly generate engaging advertising videos based on product characteristics and promotional needs. For example, when producing electronics advertisements, the model can generate promotional videos that highlight product advantages by describing the functions and features through text, combined with cool special effects and scenes.
- Short Video Creation: Individual creators can use Wan 2.1 to transform creative text or images into interesting videos when creating content on short video platforms. For example, to produce a food short video, inputting text like "the process of making a delicious cake" can generate the corresponding video, adding suitable music and text effects to enhance video quality and appeal.
- Filmmaking Assistance: Film production teams can use Wan 2.1 to quickly visualize scenes from screenplays during the early creative conception and concept validation stages. For example, directors can input screenplay segments to generate simple video samples to evaluate scene effects and adjust shooting plans, saving time and costs.
- Education and Teaching: Teachers can use Wan 2.1 to present abstract knowledge in a vivid video format when creating teaching videos. For example, in physics teaching, the model can simulate object movement and physical phenomena to help students better understand knowledge points; in language teaching, it can generate videos containing dialogue scenes to create a language learning environment.
- Game Development: Game developers can use Wan 2.1 to produce game promotional videos and cutscenes. By inputting descriptions of characters, scenes, and plots in the game, the model can generate exquisite videos for game promotion and enhanced player experience.
Wan 2.1 Tutorial
- Environment Setup: First, ensure your device meets the requirements. For the T2V-1.3B model, a consumer-grade GPU (such as RTX 4090) needs at least 8.19GB of VRAM. Then, clone the code repository by entering
git clone https://github.com/Wan-Video/Wan2.1.git
in the terminal and navigate to the project directorycd Wan2.1
. Next, install the dependencies by runningpip install -r requirements.txt
, ensuringtorch >= 2.4.0
. - Model Download: You can use
huggingface-cli
ormodelscope-cli
to download the model. For example, usinghuggingface-cli
, first installpip install "huggingface_hub[cli]"
, then download the desired model. To download the T2V-14B model, for example, enterhuggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
. Download links and resolutions for different models, such as I2V-14B-720P, I2V-14B-480P, and T2V-1.3B, can be found in the official documentation. - Text-to-Video Generation
- Single GPU Inference without Prompt Extension: Run
python generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Specific text description"
in the terminal, whereprompt
should contain the specific video description. If using the T2V-1.3B model and encountering insufficient memory, add the parameters--offload_model True --t5_cpu
and adjust--sample_shift
(8-12) and--sample_guide_scale 6
based on performance. - Multi-GPU Inference without Prompt Extension (FSDP + xDiT USP): First install
xfuser
by runningpip install "xfuser>=0.4.1"
, then usetorchrun
for multi-GPU inference, such astorchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Specific text description"
. - Using Prompt Extension: If using Dashscope API for prompt extension, apply for a
dashscope.api_key
in advance and configure the environment variableDASH_API_KEY
. For example, runDASH_API_KEY=your_key python generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Specific text description" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'zh'
. If using local model extension, the default is the Qwen model on HuggingFace. Choose a suitable model based on GPU memory, such asQwen/Qwen2.5-14B-Instruct
, and specify it using--prompt_extend_model
, such aspython generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Specific text description" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'
. - Running Local Gradio: Navigate to the
gradio
directory. If using Dashscope API for prompt extension, runDASH_API_KEY=your_key python t2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir ./Wan2.1-T2V-14B
; if using local model extension, runpython t2v_14B_singleGPU.py --prompt_extend_method 'local_qwen' --ckpt_dir ./Wan2.1-T2V-14B
.
- Single GPU Inference without Prompt Extension: Run
- Image-to-Video Generation: Similar to text-to-video generation, this also involves steps with and without prompt extension. Without prompt extension, single GPU inference runs
python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --prompt "Specific text description"
. Note that thesize
parameter should be determined based on the aspect ratio of the input image. For multi-GPU inference, first installxfuser
, then runtorchrun --nproc_per_node=8 generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Specific text description"
. When using prompt extension, refer to the text-to-video prompt extension method and choose to use the Dashscope API or a local model as needed. When running local gradio, based on the model version used, run the corresponding command in thegradio
directory. For example, if using the 720P model and Dashscope API for prompt extension, runDASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-I2V-14B-720P
. - Text-to-Image Generation: Without prompt extension, single GPU inference runs
python generate.py --task t2i-14B --size 1024*1024 --ckpt_dir ./Wan2.1-T2V-14B --prompt 'Specific text description'
; multi-GPU inference runstorchrun --nproc_per_node=8 generate.py --dit_fsdp --t5_fsdp --ulysses_size 8 --base_seed 0 --frame_num 1 --task t2i-14B --size 1024*1024 --prompt 'Specific text description' --ckpt_dir ./Wan2.1-T2V-14B
. With prompt extension, add the--use_prompt_extend
parameter for single GPU inference, and similarly add this parameter for multi-GPU inference.
Conclusion
In the booming development of AI technology, the video generation field continues to see innovative breakthroughs. Alibaba's Wan 2.1 open-source video generation model has attracted much attention. It not only surpasses many similar products in performance but also provides developers and creators with powerful and easy-to-use tools, sparking widespread industry interest.
If you are captivated by Wan 2.1, try it out and experience its unique advantages in your creations. Like, comment, and share your amazing experiences, and join us in witnessing AI video generation technology reach new heights. Stay tuned for Wan 2.1, and look forward to its future potential, bringing more unimaginable surprises and reshaping our understanding of video creation.