GLIGEN is an open-ended image generation model based on textual prompts, capable of generating images based on textual descriptions and bounding boxes, among other constraints. This model achieves its capability by freezing pre-trained text-to-image Diffusion model parameters and inserting new data within them. Its modular design allows for efficient training and offers strong inferential flexibility. GLIGEN supports conditional image generation in an open world and possesses strong generalization capabilities for new concepts and layouts.