Today, Step Star and Geely Automobile Group announced a collaboration to open source two models from the Step series of multimodal large models—Step-Video-T2V video generation model and Step-Audio speech model.
The Step-Video-T2V video generation model is globally leading in both parameter count and performance. This model has 30 billion parameters and can directly generate high-quality videos with 204 frames at 540P resolution, ensuring high information density and strong consistency in the generated content. Evaluation results show that Step-Video-T2V excels in aspects such as instruction adherence, motion smoothness, physical realism, and aesthetic quality, significantly surpassing existing best open-source video models on the market.
Currently, both models are live on the Yuewen App, available for developers to experience and provide valuable feedback.
The Step-Video-T2V video generation model demonstrates exceptional generation capabilities in complex movements, aesthetically pleasing characters, and visual imagination. It can accurately understand instructions and efficiently assist video creators in realizing their creative presentations. Whether it's the elegant beauty of ballet, the intense action of karate, the thrilling excitement of badminton, or the rapid flips of diving, Step-Video-T2V can generate realistic scenes that adhere to physical laws.
Additionally, it supports various camera movements and scene transitions, capable of generating visually impactful effects with significant camera motion. The generated characters appear more realistic and vivid, with rich details and natural expressions.
GitHub:
https://github.com/stepfun-ai/Step-Audio
Hugging Face:
https://huggingface.co/collections/stepfun-ai/step-audio-67b33accf45735bb21131b0b
Technical Report:
https://github.com/stepfun-ai/Step-Audio/blob/main/assets/Step-Audio.pdf