Conditional Image Generation (CIG) refers to the process of learning and sampling image distributions that satisfy given explicit condition variables using generative models, thereby synthesising high-quality images with specified semantic, structural, or stylistic attributes. In recent years, diffusion models have demonstrated significant advantages in conditional image generation tasks, driving a paradigm shift from "random generation" to "controllable creation." This paper provides a systematic review of research on conditional image generation based on diffusion models: it illustrates the fundamental principles and methods of diffusion models, and introduces the current mainstream model development trends from several perspectives, including semantic precise control, spatial structure constraints, style variability, heterogeneous modality fusion, and dynamic temporal generation. It summarises the latest results from benchmark datasets, such as MS-COCO, DrawBench, and T2I-CompBench, as well as evaluation metrics like FID and CLIP Score. It also discusses future challenges such as large-scale unified models, physical consistency, privacy protection, and edge deployment, and looks forward to potential breakthroughs in content creation, autonomous driving, medical imaging, and virtual reality scenarios. This review aims to provide researchers with a comprehensive technical roadmap, promoting continuous innovation in the theory and applications of conditional image generation.
Yuanzhi ZhuZhaohai LiTianwei WangMengchao HeCong Yao
Alex Ling Yu HungKai ZhaoHaoxin ZhengRan YanSteven S. RamanDemetri TerzopoulosKyunghyun Sung
Joshua SchaefferkoetterPaul SchleyerMaurizio Conti
Haomiao NiChanghao ShiKai LiXiaolei HuangMartin Renqiang Min
Francesco PezoneOsman MusaGiuseppe CaireSergio Barbarossa