JOURNAL ARTICLE

ReFormer: The Relational Transformer for Image Captioning

Xuewen YangYingru LiuXin Wang

Year: 2022 Journal:   Proceedings of the 30th ACM International Conference on Multimedia Pages: 5398-5406

Abstract

Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation information and merge it with the object region features via concatenation or convolution to get the final input for sentence decoding. However, the GCN-based encoders in the existing methods are less effective for captioning due to two reasons. First, using the image captioning as the objective (i.e., Maximum Likelihood Estimation) rather than a relation-centric loss cannot fully explore the potential of the encoder. Second, using a pre-trained model instead of the encoder itself to extract the relationships is not flexible and cannot contribute to the explainability of the model. To improve the quality of image captioning, we propose a novel architecture ReFormer- a RElational transFORMER to generate features with relation information embedded and to explicitly express the pair-wise relationships between objects in the image. ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model. This design allows ReFormer to generate not only better image captions with the benefit of extracting strong relational image features, but also scene graphs to explicitly describe the pair-wise relationships. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation.

Keywords:
Closed captioning Computer science Transformer Artificial intelligence Scene graph Encoder Graph Decoding methods Sentence Image (mathematics) Computer vision Natural language processing Theoretical computer science Rendering (computer graphics) Algorithm Voltage

Metrics

61
Cited By
4.07
FWCI (Field Weighted Citation Impact)
26
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Relational-Convergent Transformer for image captioning

Lizhi ChenYou-Fu YangJuntao HuLongyue PanHao Zhai

Journal:   Displays Year: 2023 Vol: 77 Pages: 102377-102377
JOURNAL ARTICLE

Relational Graph Reasoning Transformer for Image Captioning

Xinyu XiaoZixun SunTingtian LiYipeng Yu

Journal:   2022 IEEE International Conference on Multimedia and Expo (ICME) Year: 2022
BOOK-CHAPTER

Image Captioning with Relational Knowledge

Huan YangDandan SongLejian Liao

Lecture notes in computer science Year: 2018 Pages: 378-386
JOURNAL ARTICLE

Rotary transformer for image captioning

Yile QiuLi Zhu

Year: 2022 Pages: 1-1
© 2026 ScienceGate Book Chapters — All rights reserved.