JOURNAL ARTICLE

Fine-Grained Image Captioning With Global-Local Discriminative Objective

Jie WuTianshui ChenHefeng WuZhi YangGuangchun LuoLiang Lin

Year: 2020 Journal:   IEEE Transactions on Multimedia Vol: 23 Pages: 2413-2427   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Fig. 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

Keywords:
Closed captioning Discriminative model Computer science Artificial intelligence Margin (machine learning) Ground truth Sentence Word (group theory) Constraint (computer-aided design) Natural language processing Perspective (graphical) Image (mathematics) Pattern recognition (psychology) Machine learning Speech recognition Linguistics Mathematics

Metrics

83
Cited By
5.56
FWCI (Field Weighted Citation Impact)
109
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Fine-grained Image Captioning with CLIP Reward

Jaemin ChoSeunghyun YoonAjinkya KaleFranck DernoncourtTrung BuiMohit Bansal

Journal:   Findings of the Association for Computational Linguistics: NAACL 2022 Year: 2022 Pages: 517-527
JOURNAL ARTICLE

Fine-Grained Features for Image Captioning

Mengyue ShaoJie FengJie WuHaixiang ZhangYayu Zheng

Journal:   Computers, materials & continua/Computers, materials & continua (Print) Year: 2023 Vol: 75 (3)Pages: 4697-4712
JOURNAL ARTICLE

Fine-Grained Image Captioning by Ranking Diffusion Transformer

Jun WanMin GanLefei ZhangJie ZhouJun LiuBo DuC. L. Philip Chen

Journal:   IEEE Transactions on Image Processing Year: 2025 Vol: 34 Pages: 8332-8344
© 2026 ScienceGate Book Chapters — All rights reserved.