Deep floyd
A highly realistic text-to-image model
CommonProductImageText-to-imageImage synthesis
Deep floyd is an open-source text-to-image model with high realism and language understanding capabilities. It consists of a frozen text encoder and three cascaded pixel diffusion modules: a base model generates 64x64 pixel images based on text prompts, and two super-resolution models generate images with gradually increasing resolutions: 256x256 pixels and 1024x1024 pixels. All stages of the model utilize a frozen T5 transformer-based text encoder to extract text embeddings, which are then input into a UNet architecture enhanced with cross-attention and attention pooling. This efficient model surpasses current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work highlights the potential of larger UNet architectures in the first stage of cascaded diffusion models and demonstrates a promising future for text-to-image synthesis.
Deep floyd Visit Over Time
Monthly Visits
492133528
Bounce Rate
36.20%
Page per Visit
6.1
Visit Duration
00:06:33
Deep floyd Visit Trend
Deep floyd Visit Geography
Deep floyd Traffic Sources
Deep floyd Alternatives

Meissonic — High-resolution text-to-image synthesis model
•Text-to-Image Synthesis•High-Resolution
252

FLUX.1-dev — A text-to-image generation model with 1.2 billion parameters
•Image Generation•AI Art
606