JOURNAL ARTICLE

ArCo: Attention-reinforced transformer with contrastive learning for image captioning

Zhongan WangShuai ShiZirong ZhaiYingna WuRui Yang

Year: 2022 Journal:   Image and Vision Computing Vol: 128 Pages: 104570-104570   Publisher: Elsevier BV

Abstract

Image captioning is a significant step toward achieving automatic interactions between humans and computers, in which a textual sequence of the content of an image is generated. Recently, the transformer-based encoder–decoder paradigm has made great achievements in image captioning. This method is usually trained with a cross-entropy loss function. However, for various captions of images with the same meaning, the computed losses may be different. The result is that the descriptions of images tend to be consistent, which limits the diversity of image captioning. In this paper, we present an attention-reinforced transformer, a transformer-based architecture for image captioning. The architecture improves the image encoding stage, which exploits the relationships between image regions by integrating a feature attention block (FAB). During the training phase, we trained the model with a combination of cross-entropy loss and contrastive loss. We experimentally explored the performance of ArCo and other fully attentive models. We also validated the baseline of the transformer for image captioning with different pre-trained models. Our proposed approach was demonstrated to achieve a new state-of-the-art performance on the offline ‘Karpathy’ test split and online test server.

Keywords:
Closed captioning Transformer Computer science Artificial intelligence Encoder Cross entropy Image (mathematics) Natural language processing Speech recognition Pattern recognition (psychology) Engineering Voltage

Metrics

16
Cited By
1.98
FWCI (Field Weighted Citation Impact)
52
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

BOOK-CHAPTER

The CAA Captioner–Enhancing Image Captioning with Contrastive Learning and Attention on Attention Mechanism

Zhao Cui

Smart innovation, systems and technologies Year: 2024 Pages: 279-295
JOURNAL ARTICLE

Attention-Aligned Transformer for Image Captioning

Zhengcong Fei

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2022 Vol: 36 (1)Pages: 607-615
BOOK-CHAPTER

Reinforced Transformer for Medical Image Captioning

Yuxuan XiongBo DuPingkun Yan

Lecture notes in computer science Year: 2019 Pages: 673-680
JOURNAL ARTICLE

Transformer with sparse self‐attention mechanism for image captioning

Duofeng WangHaifeng HuDihu Chen

Journal:   Electronics Letters Year: 2020 Vol: 56 (15)Pages: 764-766
© 2026 ScienceGate Book Chapters — All rights reserved.