JOURNAL ARTICLE

Entangled Transformer for Image Captioning

Abstract

In image captioning, the typical attention mechanisms are arduous to identify the equivalent visual signals especially when predicting highly abstract words. This phenomenon is known as the semantic gap between vision and language. This problem can be overcome by providing semantic attributes that are homologous to language. Thanks to the inherent recurrent nature and gated operating mechanism, Recurrent Neural Network (RNN) and its variants are the dominating architectures in image captioning. However, when designing elaborate attention mechanisms to integrate visual inputs and semantic attributes, RNN-like variants become unflexible due to their complexities. In this paper, we investigate a Transformer-based sequence modeling framework, built only with attention layers and feedforward layers. To bridge the semantic gap, we introduce EnTangled Attention (ETA) that enables the Transformer to exploit semantic and visual information simultaneously. Furthermore, Gated Bilateral Controller (GBC) is proposed to guide the interactions between the multimodal information. We name our model as ETA-Transformer. Remarkably, ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation studies validate the improvements of our proposed modules.

Keywords:
Closed captioning Computer science Transformer Semantic gap Recurrent neural network Exploit Artificial intelligence Natural language processing Artificial neural network Image (mathematics) Image retrieval Engineering Voltage

Metrics

378
Cited By
21.59
FWCI (Field Weighted Citation Impact)
73
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Rotary transformer for image captioning

Yile QiuLi Zhu

Year: 2022 Pages: 1-1
JOURNAL ARTICLE

Image Captioning using Transformer Model

Anisha AdhikariMahigya DahalRudra NepalPriya Shilpakar

Journal:   Proceedings of International Conference on Innovation in Computing Science Engineering and Technology Year: 2025 Vol: 2 (1)
JOURNAL ARTICLE

Visual Image Captioning through Transformer

Muneeb NabiRohit PachauriShouaib AhmadKanishk VarshneyPrachi GoelApurva Jain

Journal:   International Journal for Research in Applied Science and Engineering Technology Year: 2023 Vol: 11 (12)Pages: 2047-2050
JOURNAL ARTICLE

S2 Transformer for Image Captioning

Pengpeng ZengHaonan ZhangJingkuan SongLianli Gao

Journal:   Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence Year: 2022 Pages: 1608-1614
JOURNAL ARTICLE

Boosted Transformer for Image Captioning

Jiangyun LiPeng YaoLongteng GuoWeicun Zhang

Journal:   Applied Sciences Year: 2019 Vol: 9 (16)Pages: 3260-3260
© 2026 ScienceGate Book Chapters — All rights reserved.