Beijing TuSimple Future Technology Co., Ltd. officially released its first "TuSheng Video" large model - "Ruyi" on December 17, 2024, and open-sourced the Ruyi-Mini-7B version for users to download from the Hugging Face platform. TuSimple was founded in 2015 and is headquartered in San Diego, California, USA, focusing on the application of AI technology in various industries, including animation, gaming, and transportation.

The Ruyi large model is designed to run on consumer-grade graphics cards, providing detailed deployment instructions and ComfyUI workflows for users to get started quickly. With its outstanding performance in frame consistency, motion fluidity, color presentation, and composition, it offers new possibilities for visual storytelling and is ideal for ACG enthusiasts, having undergone deep learning specifically for anime and gaming scenarios.

WeChat Screenshot_20241217140324.png

The Ruyi model supports multi-resolution and multi-duration generation, capable of handling resolutions from 384×384 to 1024×1024, with any aspect ratio, generating videos up to 120 frames/5 seconds long. It also supports first frame and first-last frame control generation, motion amplitude control, and five types of shot control. Ruyi is based on the DiT architecture, composed of a Casual VAE module and a Diffusion Transformer, with a total parameter count of approximately 7.1B, trained using about 200M video clips.

Despite the significant technical advancements of Ruyi, there are still some issues such as hand distortion, facial detail collapse in multi-person scenarios, and uncontrollable transitions. TuSimple is actively working to improve and address these issues in future updates.

Looking ahead, TuSimple plans to continue focusing on scene requirements, achieving breakthroughs in direct CUT generation, and offering two versions in the next release to meet the needs of different creators. The company is committed to using large models to reduce the development cycle and cost of anime and game content. The Ruyi large model is already capable of generating 5 seconds of content after inputting keyframes or generating intermediate transition content between two keyframes, thus shortening the development cycle.

Hugging Face Link:

https://huggingface.co/IamCreateAI/Ruyi-Mini-7B