Despite the considerable interest and successful progress in image generation and editing applications using diffusion models, a critical challenge persists in balancing style transfer and content preservation. Effectively addressing this challenge is crucial for enhancing the overall success and usability of image editing tools that leverage diffusion methodologies, especially text-driven or image-driven. One approach to tackle this challenge is to determine areas with high level of content in a given image. It involves preserving certain areas, containing more information than the rest of the image. To address this, we propose a method that anchors representative points in these areas for both source image and generated image. Self-attention mechanism intentionally selects queries and produces features at these anchor points. Then we employ contrastive learning in a self-supervised manner. This approach enables our method to generate an image that maintains the important content in the given source image while transferring the style. Our proposed method eliminates the need for additional fine-tuning or auxiliary networks. Our method uses conventional diffusion model, but without fine running for content preservation. Normally fine tunning is required additional network therefore ours results speeding up the inference process compared to other diffusion methods. Our experiments showcase the superior performance of our approach, particularly in preserving image content during editing, along with a notable superiority when compared to both other diffusions models and GAN-based models.
Masud An-Nur Islam FahimNazmus SaqibJani Boutellier
Yilong LiuHanyu ZhengShuojin Yang