Deep floyd

A highly realistic text-to-image model

CommonProductImageText-to-imageImage synthesis
Deep floyd is an open-source text-to-image model with high realism and language understanding capabilities. It consists of a frozen text encoder and three cascaded pixel diffusion modules: a base model generates 64x64 pixel images based on text prompts, and two super-resolution models generate images with gradually increasing resolutions: 256x256 pixels and 1024x1024 pixels. All stages of the model utilize a frozen T5 transformer-based text encoder to extract text embeddings, which are then input into a UNet architecture enhanced with cross-attention and attention pooling. This efficient model surpasses current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work highlights the potential of larger UNet architectures in the first stage of cascaded diffusion models and demonstrates a promising future for text-to-image synthesis.
Visit

Deep floyd Visit Over Time

Monthly Visits

515580771

Bounce Rate

37.20%

Page per Visit

5.8

Visit Duration

00:06:42

Deep floyd Visit Trend

Deep floyd Visit Geography

Deep floyd Traffic Sources

Deep floyd Alternatives