ArCo: Attention-reinforced transformer with contrastive learning for image captioning

Zhongan Wang; Shuai Shi; Zirong Zhai; Yingna Wu; Rui Yang

doi:10.1016/j.imavis.2022.104570

ScienceGate Book Chapters

JOURNAL ARTICLE

ArCo: Attention-reinforced transformer with contrastive learning for image captioning

Zhongan Wang Shuai Shi Zirong Zhai Yingna Wu Rui Yang

Year: 2022 Journal: Image and Vision Computing Vol: 128 Pages: 104570-104570 Publisher: Elsevier BV

DOI: 10.1016/j.imavis.2022.104570

Get Full-Text PDF Get Analytical Report

Abstract

Image captioning is a significant step toward achieving automatic interactions between humans and computers, in which a textual sequence of the content of an image is generated. Recently, the transformer-based encoder–decoder paradigm has made great achievements in image captioning. This method is usually trained with a cross-entropy loss function. However, for various captions of images with the same meaning, the computed losses may be different. The result is that the descriptions of images tend to be consistent, which limits the diversity of image captioning. In this paper, we present an attention-reinforced transformer, a transformer-based architecture for image captioning. The architecture improves the image encoding stage, which exploits the relationships between image regions by integrating a feature attention block (FAB). During the training phase, we trained the model with a combination of cross-entropy loss and contrastive loss. We experimentally explored the performance of ArCo and other fully attentive models. We also validated the baseline of the transformer for image captioning with different pre-trained models. Our proposed approach was demonstrated to achieve a new state-of-the-art performance on the offline ‘Karpathy’ test split and online test server.

Keywords:

Closed captioning Transformer Computer science Artificial intelligence Encoder Cross entropy Image (mathematics) Natural language processing Speech recognition Pattern recognition (psychology) Engineering Voltage

Metrics

Cited By

1.98

FWCI (Field Weighted Citation Impact)

Refs

0.85

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

ArCo: Attention-reinforced transformer with contrastive learning for image captioning

Abstract

Metrics

Citation History

Topics

Related Documents

The CAA Captioner–Enhancing Image Captioning with Contrastive Learning and Attention on Attention Mechanism

Attention-Aligned Transformer for Image Captioning

Reinforced Transformer for Medical Image Captioning

Transformer with sparse self‐attention mechanism for image captioning

Enhancing image captioning with asynchronous dual attention vision transformer