The field of video generation has achieved a significant breakthrough! Genmo has released its latest video generation model, Mochi1, as an open-source project, setting a new standard in the industry. Mochi1 employs an innovative Asymmetric Diffusion Transformer (AsymmDiT) architecture with up to 1 billion parameters, making it the largest publicly available video generation model to date. Notably, it was trained from scratch with a simple, modifiable architecture, greatly facilitating developers in the open-source community.

The standout feature of Mochi1 is its exceptional motion quality and precise adherence to text prompts. It can generate smooth videos up to 5.4 seconds long at 30 frames per second, with impressive temporal coherence and realistic motion dynamics. Mochi1 can also simulate various physical phenomena, such as fluid dynamics and hair simulation, with naturally flowing human-like movements that nearly match real-life performances.

To facilitate developer usage, Genmo has also open-sourced its video VAE, which can compress videos to 1/128 of their original size, effectively reducing computational and memory demands. The AsymmDiT architecture processes user prompts and compressed video tokens efficiently through a multi-modal self-attention mechanism and learns separate MLP layers for each modality, further enhancing the model's efficiency and performance.

image.png

The release of Mochi1 marks a significant step forward in the open-source video generation field. Genmo plans to release the full version of Mochi1 by the end of the year, including Mochi1HD, which will support 720p video generation, thereby enhancing fidelity and smoothness.

To allow more people to experience the powerful capabilities of Mochi1, Genmo has launched a free hosted playground where users can try it out at genmo.ai/play. The weights and architecture of Mochi1 are also publicly available on the HuggingFace platform for developers to download and use.

Genmo is composed of core members from projects like DDPM, DreamFusion, and Emu Video, with an advisory team including industry leaders such as Ion Stoica, Executive Chairman and Co-founder of Databricks and Anyscale; Pieter Abbeel, Co-founder of Covariant and early team member of OpenAI; and Joey Gonzalez, pioneer in language model systems and Co-founder of Turi. Genmo's mission is to unlock the right brain of artificial general intelligence, with Mochi1 being the first step towards building a world simulator capable of imagining anything, possible or impossible.

Genmo recently completed an A-round financing led by NEA, totaling $28.4 million, which will provide ample funding for their future research and development.

While Mochi1 has achieved remarkable accomplishments, it still has some limitations. For instance, the initial version can only generate 480p videos and may exhibit slight distortions and distortions in extreme motion edge cases. Additionally, Mochi1 is primarily optimized for photorealistic styles, with performance in animated content yet to be improved.

Genmo plans to continue refining Mochi1 and encourages the community to fine-tune the model to suit different aesthetic preferences. They have also implemented robust safety protocols in the playground to ensure all video generation complies with ethical standards.

Model download: https://huggingface.co/genmo/mochi-1-preview

Online experience: https://www.genmo.ai/play

Official introduction: https://www.genmo.ai/blog