In the field of game development, the diversity and innovation of scenes have always been a challenge. Recently, the University of Hong Kong collaborated with Kuaishou Technology to develop an innovative framework called GameFactory, aimed at addressing the scene generalization problem in game video generation. This framework utilizes pre-trained video diffusion models, allowing it to train on open-domain video data to generate entirely new and diverse game scenes.
Video diffusion models, as an advanced generative technology, have shown great potential in video generation and physical simulation in recent years. These models can respond to user input, such as keyboard and mouse actions, much like video generation tools, to create corresponding game visuals. However, scene generalization, which refers to the ability to create entirely new game scenes beyond existing ones, remains a significant challenge in this field. While collecting large amounts of action-labeled video datasets is a direct method to tackle this issue, it is time-consuming and labor-intensive, especially in open-domain scenarios.
The launch of the GameFactory framework is precisely to address this challenge. By utilizing pre-trained video diffusion models, GameFactory can reduce its over-reliance on specific game datasets and support the generation of diverse game scenes. Furthermore, to bridge the gap between open-domain prior knowledge and limited game datasets, GameFactory employs a unique three-stage training strategy.
In the first stage, the LoRA (Low-Rank Adaptation) technique is used to fine-tune the pre-trained model, adapting it to a specific game domain while preserving the original parameters. The second stage involves freezing the pre-trained parameters and focusing on training the action control module to avoid confusion between style and control. Finally, in the third stage, the LoRA weights are removed, retaining the action control module parameters, enabling the system to generate controlled game videos across different open-domain scenes.
The researchers also evaluated the effectiveness of different control mechanisms, finding that the cross-attention mechanism performed better in handling discrete control signals like keyboard inputs, while the concatenation method was more effective for mouse movement signals. GameFactory also supports autoregressive action control, capable of generating interactive game videos of unlimited length. Additionally, the research team released a high-quality action-labeled video dataset, GF-Minecraft, for training and evaluation purposes of the framework.
Paper: https://arxiv.org/abs/2501.08325
Key Points:
🌟 The GameFactory framework was jointly developed by the University of Hong Kong and Kuaishou Technology to solve the scene generalization problem in game video generation.
🎮 This framework utilizes pre-trained video diffusion models to generate diverse game scenes and employs a three-stage training strategy to enhance effectiveness.
📊 The researchers also released the action-labeled video dataset GF-Minecraft to support the training and evaluation of GameFactory.