A research team from New York University, MIT, and Google has recently proposed an innovative framework aimed at addressing the bottleneck issues of diffusion models in extending inference time. This groundbreaking study goes beyond the traditional method of simply increasing the denoising steps, opening new avenues for enhancing the performance of generative models.
The framework operates on two main dimensions: first, utilizing validators to provide feedback, and second, implementing algorithms to discover better noise candidates. The research team built upon the pre-trained SiT-XL model with a resolution of 256×256, innovatively introducing additional computational resources specifically for search operations while maintaining 250 fixed denoising steps.
In terms of the validation system, the research employed two Oracle Verifiers: Inception Score (IS) and Fréchet Inception Distance (FID). IS selects the highest classification probability based on the pre-trained InceptionV3 model, while FID aims to minimize the difference between the statistics of pre-computed ImageNet Inception features.
Experimental results show that the framework performs exceptionally well across multiple benchmarks. In the DrawBench test, evaluations by the LLM Grader confirmed that the search validation method consistently improves sample quality. Notably, ImageReward and Verifier Ensemble achieved significant advancements across various metrics, thanks to their precise evaluation capabilities and high alignment with human preferences.
This research not only validates the effectiveness of search-based computational scaling methods but also reveals the inherent biases of different validators, pointing the way for future development of more specialized validation systems for visual generative tasks. This finding is of great significance for enhancing the overall performance of AI generative models.