JOURNAL ARTICLE

Fine-Grained Image Captioning by Ranking Diffusion Transformer

Jun WanMin GanLefei ZhangJie ZhouJun LiuBo DuC. L. Philip Chen

Year: 2025 Journal:   IEEE Transactions on Image Processing Vol: 34 Pages: 8332-8344   Publisher: Institute of Electrical and Electronics Engineers

Abstract

The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision-language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision-language semantic alignment. We show that by collaborating RVE and RL via the novel RDT-and by gradually adding and removing noise in the diffusion process-more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: https://github.com/junwan2014/RDT.

Keywords:

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
51
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Related Documents

JOURNAL ARTICLE

FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning

Bo WangZhao ZhangJicong FanMingbo ZhaoChoujun ZhanMingliang Xu

Journal:   2022 IEEE International Conference on Data Mining (ICDM) Year: 2022 Pages: 508-517
JOURNAL ARTICLE

Fine-Grained Features for Image Captioning

Mengyue ShaoJie FengJie WuHaixiang ZhangYayu Zheng

Journal:   Computers, materials & continua/Computers, materials & continua (Print) Year: 2023 Vol: 75 (3)Pages: 4697-4712
JOURNAL ARTICLE

Fine-grained Image Captioning with CLIP Reward

Jaemin ChoSeunghyun YoonAjinkya KaleFranck DernoncourtTrung BuiMohit Bansal

Journal:   Findings of the Association for Computational Linguistics: NAACL 2022 Year: 2022 Pages: 517-527
JOURNAL ARTICLE

MCoCa: Towards fine-grained multimodal control in image captioning

Shanshan ZhaoTeng WangJinrui ZhangXiangchen WangFeng Zheng

Journal:   Pattern Recognition Year: 2025 Vol: 172 Pages: 112381-112381
© 2026 ScienceGate Book Chapters — All rights reserved.