AI-powered video generation technology is rapidly advancing. Recently, an open-source video model called Pusa has garnered significant industry attention. This model, fine-tuned from the leading open-source video generation system Mochi, not only demonstrates acceptable results but, more importantly, has fully open-sourced the entire fine-tuning process, including training tools and the dataset, with a training cost of approximately $100, opening up new possibilities for research and application in the field of video generation.

QQ_1744595106005.png

Mochi-Based Fine-tuning, Showcasing Initial Video Generation Capabilities

Pusa-V0.5 is a preliminary version of the Pusa model, with Mochi1-Preview, a leading open-source video generation system on the Artificial Analysis Leaderboard, as its base model. Thanks to fine-tuning from Mochi, Pusa can support various video generation tasks, including text-to-video generation, image-to-video conversion, frame interpolation, video transitions, seamless looping, and extended video generation. Although the resolution of currently generated videos is relatively low (480p), it shows potential in terms of motion fidelity and prompt adherence.

Completely Open-Sourced Fine-tuning Workflow, Driving Community Collaboration

One of the most remarkable features of the Pusa project is its complete open-sourcing. Developers can not only access the complete code repository and detailed architecture specifications but also learn about the complete training methods. This means that researchers and developers can thoroughly understand Pusa's fine-tuning process, reproduce experiments, and, on this basis, carry out innovations and improvements. This open approach will undoubtedly greatly boost community cooperation and development.

Surprisingly Low Training Cost

Compared to large video models whose training often costs tens of thousands or even hundreds of thousands of dollars, Pusa's training cost is particularly striking. Reportedly, Pusa only used 16 H800 GPUs, completing training after approximately 500 iterations, with a total training time of only 0.1k H800 GPU hours and a total cost of approximately $0.1k (i.e., $100). This low training cost provides opportunities for more research institutions and individual developers to participate in video model research and development. The project team also indicated that efficiency can be further improved through single-node training and more advanced parallelization techniques.

Pusa utilizes a novel diffusion paradigm based on frame-level noise control and vectorized time steps, a method initially proposed in the FVDM paper, bringing unprecedented flexibility and scalability to video diffusion modeling. Furthermore, the adjustments made to the base model in Pusa are non-destructive, meaning it retains the text-to-video generation capabilities of the original Mochi, requiring only light fine-tuning.

Project: https://top.aibase.com/tool/pusa