FIFO-Diffusion is a novel inference technique based on pre-trained diffusion models for text-conditioned video generation. It enables the generation of videos of unlimited length without training, by iteratively executing diagonal denoising while handling an increasing level of noise across a series of consecutive frames within a queue. The methodDequeues a fully denoised frame from the head, while enqueueing a new random noise frame at the tail. Additionally, latent disentanglement is introduced to reduce the training-inference gap, and future denoising is utilized to leverage the benefits of forward references.