Alibaba recently announced the open-sourcing of its latest first-last frame video generation model, Wan2.1-FLF2V-14B, capable of generating 5-second, 720p high-definition videos. This model has garnered significant attention for its innovative first-last frame control technology, opening up new possibilities in the field of AI video generation. According to AIbase, the model was launched on GitHub and Hugging Face in February 2025, available for free use by global developers, researchers, and commercial organizations, marking another important milestone in Alibaba's open-source AI ecosystem development.
Core Function: First-Last Frame Driven, Generating Smooth High-Definition Videos
Wan2.1-FLF2V-14B uses the first and last frames as control conditions. Users only need to provide two images, and the model automatically generates a 5-second, 720p video. AIbase observed that the generated videos exhibit excellent smoothness and seamless transitions between the first and last frames. The image details closely resemble the reference images, and the overall content consistency is significantly improved. Compared to traditional video generation models, this model, through precise conditional control, solves the common problems of image jitter and content drift in long-sequence video generation, providing an efficient solution for high-quality video creation.
Technical Highlights: CLIP and DiT Integration Enhance Generation Stability
According to AIbase analysis, Wan2.1-FLF2V-14B employs advanced first-last frame conditional control technology, primarily based on the following innovations:
CLIP Semantic Feature Extraction: The CLIP model extracts semantic information from the first and last frames to ensure that the generated video is highly consistent with the input images in terms of visual content.
Cross-Attention Mechanism: The first and last frame features are injected into the Diffusion Transformer (DiT) generation process to enhance image stability and the coherence of the time series.
Data-Driven Training: The model is trained on a massive dataset of 150 million videos and 1 billion images, enabling it to generate dynamic content that conforms to real-world physical laws.
The combination of these technologies enables Wan2.1-FLF2V-14B to excel in generating complex motion scenes, making it particularly suitable for creative applications requiring high-fidelity transitions.
Wide Applications: Empowering Content Creation and Research
The open-sourcing of Wan2.1-FLF2V-14B offers vast application prospects across multiple fields. AIbase has summarized its main application scenarios:
Film and Advertising: Quickly generate high-quality transition videos, reducing post-production costs.
Game Development: Generate dynamic cutscenes for game environments, improving development efficiency.
Education and Research: Support researchers in exploring video generation technology and developing new AI applications.
Personalized Creation: Ordinary users can generate personalized short videos through simple input, enriching social media content.
It's worth noting that the model supports Chinese prompt generation and performs even better when handling Chinese scenarios, demonstrating its adaptability to multilingual environments.
Ease of Use: Adaptable to Consumer-Grade Hardware
Wan2.1-FLF2V-14B demonstrates high versatility in hardware requirements. AIbase understands that despite its relatively large scale of 1.4 billion parameters, through optimization, the model can run on devices equipped with consumer-grade GPUs such as RTX 4090, with memory requirements as low as 8.19 GB. Generating a 5-second 480p video takes approximately 4 minutes, and the generation time for 720p videos remains within a reasonable range. Furthermore, the model provides detailed deployment instructions. Users can quickly start using it with the following command:
python
python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B --first_frame examples/first.jpg --last_frame examples/last.jpg --prompt "A smooth transition from a sunny beach to a starry night"
The open-source community also provides a Gradio-based web UI, further reducing the barrier to entry for non-technical users.
Community Feedback and Future Outlook
Since its release, Wan2.1-FLF2V-14B has generated enthusiastic responses in the open-source community. Developers highly praise its generation quality, hardware friendliness, and open-source strategy. AIbase has noted that the community has begun secondary development around the model, exploring more complex video editing functions, such as dynamic subtitle generation and multilingual dubbing. In the future, Alibaba plans to further optimize the model to support higher resolutions (such as 8K) and longer video generation, while also expanding its applications in areas such as video-to-audio (V2A).
Project Address: https://github.com/Wan-Video/Wan2.1