This paper traces the evolution of text-to-image generation (TIG) techniques from Generative Adversarial Networks (GANs) to Diffusion Models (DMs). It first introduces GAN variants, including DCGAN, WGAN, MGGAN, and StyleGAN. While these popular GANs pioneered image synthesis through adversarial training of the generator and discriminator, they suffered from training instability, mode collapse, and the lack of diversity. Therefore, the systematic introduction of representative DMs—including DDPM, Guided Diffusion, GLIDE, Stable Diffusion, and Imagen—shows how they address these issues through iteratively denoising, achieving unprecedented image fidelity, semantic alignment, and generation stability. Quantitative comparisons on datasets such as COCO and CUB show that DMs consistently outperform GANs in metrics like FID, IS, and CLIP score, though GANs retain shorter inference time. Nevertheless, critical challenges such as generation efficiency, understanding of complex prompts, and safety controls remain. This paper analyses possible reasons for those problems while pointing out key directions for future work.
Tarushi KhattarSimon R. BareTanya TanyaSakshi KuyateVaishali Wangikar
Nikita SinghalPraval SinghNikhil SinghMahipal SinghHarsimrat Singh
Hafiz Arslan RamzanSadia RamzanTehmina KalsumMomina RaufMaria Sehar
Yanwu XuYang ZhaoZhisheng XiaoTingbo Hou