Fine-Grained Image Captioning With Global-Local Discriminative Objective

Jie Wu; Tianshui Chen; Hefeng Wu; Zhi Yang; Guangchun Luo; Liang Lin

doi:10.1109/tmm.2020.3011317

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-Grained Image Captioning With Global-Local Discriminative Objective

Jie Wu Tianshui Chen Hefeng Wu Zhi Yang Guangchun Luo Liang Lin

Year: 2020 Journal: IEEE Transactions on Multimedia Vol: 23 Pages: 2413-2427 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tmm.2020.3011317

Get Full-Text PDF Get Analytical Report

Abstract

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Fig. 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

Keywords:

Closed captioning Discriminative model Computer science Artificial intelligence Margin (machine learning) Ground truth Sentence Word (group theory) Constraint (computer-aided design) Natural language processing Perspective (graphical) Image (mathematics) Pattern recognition (psychology) Machine learning Speech recognition Linguistics Mathematics

Metrics

Cited By

5.56

FWCI (Field Weighted Citation Impact)

109

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Fine-Grained Image Captioning With Global-Local Discriminative Objective

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-grained Image Captioning with CLIP Reward

Fine-Grained Features for Image Captioning

Concrete Image Captioning by Integrating Content Sensitive and Global Discriminative Objective

Fine-Grained Image Captioning by Ranking Diffusion Transformer

Image Difference Captioning With Instance-Level Fine-Grained Feature Representation