In the field of image generation, multi-layered image generation techniques are revolutionizing how users interact with generative models, allowing for the isolation, selection, and editing of specific image layers. Recently, Microsoft researchers introduced a novel technique called "Anonymous Region Transformer" (ART), which directly generates variable, multi-layered transparent images based on global text prompts and anonymous region layouts.

QQ_1741139755445.png

ART's design is inspired by "schema theory." By employing anonymous region layouts, the generative model can autonomously decide which visual information aligns with which text information. This approach contrasts sharply with traditional semantic layouts, which typically require explicit correspondences. ART's anonymous region layout offers greater flexibility.

Notably, ART introduces a per-layer region cropping mechanism that selects only the visual information relevant to each anonymous region, significantly reducing the cost of attention computation. This method not only accelerates generation, making it over 12 times faster than full-attention methods, but also effectively minimizes conflicts between layers, enabling the handling of over 50 different layers of image generation.

Furthermore, ART proposes a high-quality multi-layered transparent image autoencoder that supports the joint encoding and decoding of the transparency of variable multi-layered images. This innovative design offers new possibilities for precise control and scalable layer generation, further advancing interactive content creation.

Project: https://art-msra.github.io/

Key Highlights:

🌟 ART directly generates multi-layered transparent images based on global text prompts and anonymous region layouts.

⚡️ A per-layer region cropping mechanism significantly improves image generation efficiency, making it 12 times faster than traditional methods.

💡 A novel high-quality autoencoder enables precise control and generation of multi-layered transparent images, advancing interactive content creation.