Reinforcement learning has achieved many successes in recent years, but its low sample efficiency limits its application in the real world. World models, as a type of environment generation model, offer hope in addressing this issue. They can serve as simulated environments to train reinforcement learning agents with higher sample efficiency.

Currently, most world models simulate environmental dynamics using discrete latent variable sequences. However, this method of compressing into a compact discrete representation may overlook visual details that are crucial for reinforcement learning.

Meanwhile, diffusion models have become the dominant approach in the field of image generation, challenging traditional discrete latent variable modeling methods. Inspired by this, researchers proposed a new method called DIAMOND (Dreaming in Environments with Diffusion Models), which is a reinforcement learning agent trained within a diffusion world model. DIAMOND makes key design choices to ensure the efficiency and stability of the diffusion model over long time horizons.

image.png

DIAMOND achieved an average human-normalized score of 1.46 on the renowned Atari100k benchmark, marking the best performance for agents trained entirely within a world model. Additionally, the advantage of operating in image space allows the diffusion world model to directly replace the environment, leading to a better understanding of the behaviors of the world model and the agent. Researchers found that the performance improvements in certain games stemmed from better modeling of critical visual details.

The success of DIAMOND is attributed to the choice of the EDM (Elucidating the Design Space of Diffusion-based Generative Models) framework. Compared to traditional DDPM (Denoising Diffusion Probabilistic Models), EDM exhibits higher stability with fewer denoising steps, avoiding severe cumulative errors in the model over long time ranges.

Furthermore, DIAMOND demonstrated its capability to function as an interactive neural game engine through training on 87 hours of static data from Counter-Strike: Global Offensive, successfully generating an interactive Dust II map neural game engine.

In the future, DIAMOND could further enhance its performance by integrating more advanced memory mechanisms, such as autoregressive Transformers. Additionally, incorporating reward/termination prediction into the diffusion model is also a promising direction for exploration.

Paper link: https://arxiv.org/pdf/2405.12399