JOURNAL ARTICLE

Graph Self-Attention Network for Image Captioning

Abstract

Most state-of-the-art methods for image captioning highly depend on an attention mechanism on the object regions within the encoder-decoder framework. Generally, existing attention models are based on simple addition or multiplication operations and may not fully discover the complex relationships between the visual features and the target words. In this paper, we propose a novel attention model, named graph self-attention (GSA), that incorporates graph networks and self-attention for image captioning. GSA constructs a star-graph model to dynamically assign weights to the detected object regions when generating the words step-by-step. The central node is represented by the semantic feature and the visual features of the object regions are used as edge nodes. Through propagating messages among the center and edge nodes, GSA explicitly captures the relationships between the current target word and the image features. To generate conjunctions and attributives that are not directly related to visual information, GSA introduces self-attention so that such words are allowed to focus more on the semantic information. Moreover, the GSA model is also generic and can be applied to tasks that require attention to multiple features. The experiments show the effectiveness and potentiality of our proposed GSA.

Keywords:
Closed captioning Computer science Focus (optics) Graph Artificial intelligence Encoder Visualization Attention network Word (group theory) Feature (linguistics) Enhanced Data Rates for GSM Evolution Object (grammar) Image (mathematics) Pattern recognition (psychology) Theoretical computer science Natural language processing Mathematics

Metrics

1
Cited By
0.10
FWCI (Field Weighted Citation Impact)
52
Refs
0.44
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Dual-stream Self-attention Network for Image Captioning

Boyang WanWenhui JiangYuming FangWenying WenHantao Liu

Journal:   2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) Year: 2022 Pages: 1-5
JOURNAL ARTICLE

Hybrid attention network for image captioning

Wenhui JiangQin LiKun ZhanYuming FangFei Shen

Journal:   Displays Year: 2022 Vol: 73 Pages: 102238-102238
© 2026 ScienceGate Book Chapters — All rights reserved.