In the continuous advancement of artificial intelligence, diffusion models are increasingly demonstrating remarkable reasoning capabilities, surpassing their previous role as mere followers of autoregressive models. Researchers from UCLA and Meta have recently introduced a novel framework called d1, which combines supervised fine-tuning (SFT) and reinforcement learning (RL) to significantly enhance the reasoning abilities of diffusion models, encompassing mathematical understanding and logical reasoning.

image.png

This innovative d1 framework employs a two-stage post-training strategy to improve the performance of masked large language models (dLLMs). In the first stage, the model undergoes supervised fine-tuning using high-quality reasoning trajectories, acquiring foundational knowledge and logical reasoning skills. Subsequently, in the second stage, the researchers introduce a novel policy gradient method called diffu-GRPO, specifically optimized for masked dLLMs, significantly enhancing reasoning efficiency.

Compared to previous research, d1 aims to address the challenges encountered in reinforcement learning post-training for diffusion models. Traditional autoregressive models optimize model output by calculating the log probability of the generated sequence, while dLLMs face computational difficulties due to their iterative generation nature. To overcome this, the research team developed an efficient log probability estimator that independently calculates the probability of each token, drastically reducing computation time and improving training efficiency.

In experiments, the researchers used LLaDA-8B-Instruct as the base model and compared d1-LLaDA with models trained using only SFT or diffu-GRPO. Results demonstrate that d1-LLaDA significantly outperforms the base model and single-method approaches across various mathematical and logical reasoning tests. This combined approach not only enhances the model's reasoning capabilities but also showcases a positive synergistic effect.

With the introduction of the d1 framework, the performance of diffusion models in reasoning tasks is poised for a significant leap, opening up vast avenues for future research. The researchers believe this innovative framework will propel the further development of language models, facilitating more complex reasoning and logical tasks.

Project Address: https://dllm-reasoning.github.io/