Text-driven style transfer is an important task in the field of image synthesis, aiming to merge the style of a reference image with the content described by a text prompt. Recently, significant progress has been made in text-to-image generation models, achieving finer style transfers while maintaining high fidelity to the content. This technology holds great practical value in areas such as digital painting, advertising, and game design.
However, existing style transfer technologies still have some shortcomings, with the main challenges including:
Style overfitting: Current models tend to replicate all elements of the reference image, resulting in generated images that are too close to the characteristics of the reference style image, limiting the aesthetic flexibility and adaptability of the generated images.
Inaccurate text alignment: The model may prioritize the dominant colors or patterns of the reference image, even if these elements contradict the instructions in the text prompt.
Generation artifacts: Style transfer may introduce unwanted artifacts, such as repeated patterns (like checkerboard effects), disrupting the overall layout of the image.
To address these issues, researchers have proposed three complementary strategies:
AdaIN-based cross-modal fusion: Utilizing the **Adaptive Instance Normalization (AdaIN)** mechanism to incorporate style image features into text features, which are then fused with image features. This adaptive fusion creates a more cohesive guiding feature, aligning style features more harmoniously with text-based instructions. AdaIN effectively integrates style into content by adjusting content features to reflect style statistics, while preserving consistency between content and text descriptions.
Style-based No Classifier Guidance (SCFG): Developing a style guidance method that focuses on the target style and reduces unnecessary style features. By using layout-controlled generative models (such as ControlNet), a "negative" image lacking the target style is generated. This negative image acts similarly to an "empty" prompt in diffusion models, allowing the guidance to focus entirely on the target style elements.
Using a teacher model for layout stabilization: Introducing a teacher model in the early stages of generation. This teacher model is based on the original text-to-image model, performing denoising generation with the same text prompt alongside the style model, sharing its spatial attention maps at each time step. This method ensures stable and consistent spatial distribution, effectively mitigating issues like checkerboard artifacts. Additionally, it achieves consistent spatial layouts for the same text prompt across different style reference images.
Researchers have validated the effectiveness of these methods through extensive experiments. The results indicate that the method significantly improves the quality of style transfer in generated images while maintaining consistency with the text prompts. More importantly, the method can be integrated into existing style transfer frameworks without the need for fine-tuning.
Through experiments, researchers found that instability in the cross-attention mechanism leads to the emergence of artifacts. The self-attention mechanism plays a crucial role in maintaining the layout and spatial structure of the image by capturing high-level spatial relationships to stabilize the fundamental layout during the generation process. By selectively replacing certain self-attention maps in the stylized image, the spatial relationships of key features can be preserved, ensuring that the core layout remains consistent throughout the denoising process.
Furthermore, the Style-based No Classifier Guidance (SCFG) effectively addresses the issue of style ambiguity, as it can selectively emphasize the desired style elements while filtering out irrelevant or conflicting features. This method allows the model to focus on transferring the desired style components by generating negative style images using layout-controlled models, thereby alleviating the risk of overfitting to irrelevant style components.
Researchers also conducted ablation experiments to assess the impact of each component. The results show that both AdaIN-based cross-modal fusion and the teacher model significantly enhance the accuracy of text alignment, and they exhibit complementary effects.
In summary, the methods proposed in this study effectively mitigate the issues of style overfitting and layout instability present in existing text-driven style transfer technologies, enabling higher quality image generation and providing a versatile and powerful solution for text-to-image synthesis tasks.
Paper link: https://arxiv.org/pdf/2412.08503