Fine-Grained Image Captioning by Ranking Diffusion Transformer

Jun Wan; Min Gan; Lefei Zhang; Jie Zhou; Jun Liu; Bo Du; C. L. Philip Chen

doi:10.1109/tip.2025.3641303

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-Grained Image Captioning by Ranking Diffusion Transformer

Jun Wan Min Gan Lefei Zhang Jie Zhou Jun Liu Bo Du C. L. Philip Chen

Year: 2025 Journal: IEEE Transactions on Image Processing Vol: 34 Pages: 8332-8344 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tip.2025.3641303

Get Full-Text PDF Get Analytical Report

Abstract

The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision-language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision-language semantic alignment. We show that by collaborating RVE and RL via the novel RDT-and by gradually adding and removing noise in the diffusion process-more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: https://github.com/junwan2014/RDT.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Fine-Grained Image Captioning by Ranking Diffusion Transformer

Abstract

Metrics

Topics

Related Documents

FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning

Fine-Grained Features for Image Captioning

Fine-grained Image Captioning with CLIP Reward

EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models

MCoCa: Towards fine-grained multimodal control in image captioning