GameGen-X
A diffusion model for generating and controlling open-world game videos.
CommonProductProgrammingGame GenerationInteractive Control
GameGen-X is a diffusion model specifically designed for generating and interactively controlling open-world game videos. The model achieves high-quality, open-domain video generation by simulating various features of game engines, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive control capabilities that allow it to predict and alter future content based on current video segments, simulating gameplay. To realize this vision, we meticulously collected and constructed an open-world video game dataset (OGameData) from scratch. This dataset is the first and largest of its kind for open-world video generation and control, comprising over a million diversified game video clips from more than 150 games, all equipped with informative subtitles powered by GPT-4o. GameGen-X underwent a two-phase training process, consisting of foundational model pre-training and instruction tuning. Initially, the model was pre-trained using text-to-video generation and video continuation methods, equipping it with the capability to generate long sequences of high-quality open-domain game videos. To further enhance its interactive control abilities, we developed InstructNet, which integrates expert multimodal control signals relevant to gaming. This allows the model to adjust latent representations according to user input, unifying character interaction and scene content control in video generation for the first time. During the instruction tuning phase, only InstructNet was updated while the pre-trained foundational model remained static, ensuring that the integration of interactive control capabilities did not compromise the diversity and quality of generated video content. GameGen-X represents a significant leap in video game design using generative models, demonstrating the potential of these models as complementary tools to traditional rendering techniques, effectively combining creative generation with interactive abilities.