Disentangled Latent Diffusion: Unlocking Compositional Semantic Control for High-Fidelity Image Synthesis

Revista, Zen; IA, 10

doi:10.5281/zenodo.17828681

ScienceGate Book Chapters

JOURNAL ARTICLE

Disentangled Latent Diffusion: Unlocking Compositional Semantic Control for High-Fidelity Image Synthesis

Revista, Zen IA, 10

Year: 2025 Journal: Zenodo (CERN European Organization for Nuclear Research) Publisher: European Organization for Nuclear Research

DOI: 10.5281/zenodo.17828681

Get Full-Text PDF Get Analytical Report

Abstract

The unprecedented advancements in text-to-image diffusion models have revolutionized digital content creation, yet a critical challenge persists: achieving precise, disentangled compositional semantic control. While these models excel at generating aesthetically pleasing images from broad textual prompts, they often struggle with fine-grained control over individual object attributes, their spatial relationships, and the coherent integration of multiple semantic elements within a single scene. This paper introduces Disentangled Latent Diffusion (DLD), a novel framework designed to address these limitations by explicitly separating distinct semantic factors within the latent space of a diffusion model. Our approach integrates a specialized disentanglement module that encourages the formation of independent latent dimensions corresponding to object identity, attributes, pose, and spatial location. This disentangled latent representation is then harnessed by a hierarchical compositional control mechanism, which allows users to specify prompts at varying granularities, from global scene descriptions to precise manipulation of individual components. Through a multi-stage training strategy incorporating self-supervised disentanglement objectives and a novel compositional consistency loss, DLD significantly enhances the model's ability to interpret and execute complex compositional instructions. Extensive quantitative and qualitative evaluations demonstrate that DLD achieves superior fidelity, semantic alignment, and, crucially, unprecedented levels of fine-grained compositional control compared to state-of-the-art baselines. This work represents a significant step towards more intuitive and controllable high-fidelity image synthesis, paving the way for advanced creative and professional applications.

Keywords:

Representation (politics) Consistency (knowledge bases) Object (grammar) Latent semantic analysis Control (management) Space (punctuation) Semantics (computer science)

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.69

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Aesthetic Perception and Analysis

Life Sciences → Neuroscience → Cognitive Neuroscience

Disentangled Latent Diffusion: Unlocking Compositional Semantic Control for High-Fidelity Image Synthesis

Abstract

Metrics

Topics

Related Documents

Disentangled Latent Diffusion: Unlocking Compositional Semantic Control for High-Fidelity Image Synthesis

Semantic Latent Diffusion: Unlocking Fine-Grained Control in High-Fidelity Image Generation

Semantic Latent Diffusion: Unlocking Fine-Grained Control in High-Fidelity Image Generation

High-Fidelity Guided Image Synthesis with Latent Diffusion Models

Sketch-Guided Latent Diffusion Model for High-Fidelity Face Image Synthesis