High-fidelity image generation has seen remarkable advancements with the advent of diffusion models, yet achieving precise, fine-grained control over specific semantic attributes within the generated images remains a significant challenge. Current models often struggle with disentangling complex semantic factors, leading to limited controllability or degradation in output quality when detailed modifications are attempted. This paper introduces Semantic Latent Diffusion (SLD), a novel framework designed to enhance fine-grained semantic control in high-fidelity image synthesis. SLD integrates a semantically rich latent space with the denoising process of latent diffusion models. By explicitly encoding and manipulating semantic information, such as object presence, attributes, and spatial relationships, within the latent representation, SLD empowers users with granular control over image generation without sacrificing visual quality. We propose a multi-modal conditioning mechanism that leverages textual prompts, semantic masks, and object-level tags to guide the diffusion process. Our approach demonstrates superior performance in generating images with user-specified semantic details, exhibiting improved attribute accuracy and compositional fidelity compared to state-of-the-art methods. This work paves the way for more intuitive and powerful human-AI interaction in creative and practical image generation tasks.
Haoying SunJianfei ZhangChen LiYuanxin OuyangWenge Rong