The unprecedented advancements in text-to-image diffusion models have revolutionized digital content creation, yet a critical challenge persists: achieving precise, disentangled compositional semantic control. While these models excel at generating aesthetically pleasing images from broad textual prompts, they often struggle with fine-grained control over individual object attributes, their spatial relationships, and the coherent integration of multiple semantic elements within a single scene. This paper introduces Disentangled Latent Diffusion (DLD), a novel framework designed to address these limitations by explicitly separating distinct semantic factors within the latent space of a diffusion model. Our approach integrates a specialized disentanglement module that encourages the formation of independent latent dimensions corresponding to object identity, attributes, pose, and spatial location. This disentangled latent representation is then harnessed by a hierarchical compositional control mechanism, which allows users to specify prompts at varying granularities, from global scene descriptions to precise manipulation of individual components. Through a multi-stage training strategy incorporating self-supervised disentanglement objectives and a novel compositional consistency loss, DLD significantly enhances the model's ability to interpret and execute complex compositional instructions. Extensive quantitative and qualitative evaluations demonstrate that DLD achieves superior fidelity, semantic alignment, and, crucially, unprecedented levels of fine-grained compositional control compared to state-of-the-art baselines. This work represents a significant step towards more intuitive and controllable high-fidelity image synthesis, paving the way for advanced creative and professional applications.
Jaskirat SinghStephen Jay GouldLiang Zheng
Yichen PengChunqi ZhaoHaoran XieTsukasa FukusatoKazunori Miyata