JOURNAL ARTICLE

Boosted Transformer for Image Captioning

Jiangyun LiPeng YaoLongteng GuoWeicun Zhang

Year: 2019 Journal:   Applied Sciences Vol: 9 (16)Pages: 3260-3260   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

Image captioning attempts to generate a description given an image, usually taking Convolutional Neural Network as the encoder to extract the visual features and a sequence model, among which the self-attention mechanism has achieved advanced progress recently, as the decoder to generate descriptions. However, this predominant encoder-decoder architecture has some problems to be solved. On the encoder side, without the semantic concepts, the extracted visual features do not make full use of the image information. On the decoder side, the sequence self-attention only relies on word representations, lacking the guidance of visual information and easily influenced by the language prior. In this paper, we propose a novel boosted transformer model with two attention modules for the above-mentioned problems, i.e., “Concept-Guided Attention” (CGA) and “Vision-Guided Attention” (VGA). Our model utilizes CGA in the encoder, to obtain the boosted visual features by integrating the instance-level concepts into the visual features. In the decoder, we stack VGA, which uses the visual information as a bridge to model internal relationships among the sequences and can be an auxiliary module of sequence self-attention. Quantitative and qualitative results on the Microsoft COCO dataset demonstrate the better performance of our model than the state-of-the-art approaches.

Keywords:
Computer science Closed captioning Encoder Transformer Video Graphics Array Artificial intelligence Convolutional neural network Computer vision Natural language processing Image (mathematics) Speech recognition Programming language Software

Metrics

46
Cited By
1.92
FWCI (Field Weighted Citation Impact)
34
Refs
0.89
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Rotary transformer for image captioning

Yile QiuLi Zhu

Year: 2022 Pages: 1-1
JOURNAL ARTICLE

Image Captioning using Transformer Model

Anisha AdhikariMahigya DahalRudra NepalPriya Shilpakar

Journal:   Proceedings of International Conference on Innovation in Computing Science Engineering and Technology Year: 2025 Vol: 2 (1)
JOURNAL ARTICLE

Visual Image Captioning through Transformer

Muneeb NabiRohit PachauriShouaib AhmadKanishk VarshneyPrachi GoelApurva Jain

Journal:   International Journal for Research in Applied Science and Engineering Technology Year: 2023 Vol: 11 (12)Pages: 2047-2050
JOURNAL ARTICLE

S2 Transformer for Image Captioning

Pengpeng ZengHaonan ZhangJingkuan SongLianli Gao

Journal:   Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence Year: 2022 Pages: 1608-1614
© 2026 ScienceGate Book Chapters — All rights reserved.