Deep floyd
A highly realistic text-to-image model
CommonProductImageText-to-imageImage synthesis
Deep floyd is an open-source text-to-image model with high realism and language understanding capabilities. It consists of a frozen text encoder and three cascaded pixel diffusion modules: a base model generates 64x64 pixel images based on text prompts, and two super-resolution models generate images with gradually increasing resolutions: 256x256 pixels and 1024x1024 pixels. All stages of the model utilize a frozen T5 transformer-based text encoder to extract text embeddings, which are then input into a UNet architecture enhanced with cross-attention and attention pooling. This efficient model surpasses current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work highlights the potential of larger UNet architectures in the first stage of cascaded diffusion models and demonstrates a promising future for text-to-image synthesis.
Deep floyd Visit Over Time
Monthly Visits
515580771
Bounce Rate
37.20%
Page per Visit
5.8
Visit Duration
00:06:42