JOURNAL ARTICLE

Enhancing image captioning with asynchronous dual attention vision transformer

Abstract

This paper proposes a unique approach to enhance image captioning by leveraging an Asynchronous Dual Attention (ADA) mechanism within a Vision Transformer (ViT) based framework. Traditional deep-learning models for image captioning often struggle with multimodal interactions and capturing local-to-global visual contexts, including both prominent and subtle features. To address this, the proposed model integrates global self-attention (ViT-B/16) with a Joint Calibration Module during image encoding to enhance the quality of visual embeddings and combines dynamic step-wise attention (Bahdanau) with a Gated Recurrent Unit (GRU) during decoding. This forms an ADA pipeline that decouples visual and linguistic pathways, allowing adaptive refinement of visual features and more precise alignment with linguistic context. Unlike synchronous attention models, ADA enables dynamic image region selection and improved spatial reasoning through enhanced multimodal interaction, leading to more contextually coherent and informative captions for complex visual scenes. The proposed approach demonstrates consistent improvement over state-of-the-art methods on benchmark datasets, achieving CIDEr scores of 0.946 and 1.364 and SPICE scores of 0.188 and 0.248 for Flickr 30k and MSCOCO datasets, respectively. Additionally, the framework incorporates Google’s text-to-speech synthesis to generate audio captions, enhancing accessibility for visually impaired users.

Keywords:
Closed captioning Asynchronous communication Transformer Computer science Dual (grammatical number) Computer vision Artificial intelligence Image (mathematics) Electrical engineering Engineering Telecommunications Linguistics Voltage

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
57
Refs
0.24
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Dual visual align-cross attention-based image captioning transformer

Yonggong RenJinghan ZhangWenqiang XuYuzhu LinBo FuDang N. H. Thanh

Journal:   Multimedia Tools and Applications Year: 2024 Vol: 84 (12)Pages: 10645-10664
BOOK-CHAPTER

Dual Transformer with Gated-Attention Fusion for News Disaster Image Captioning

Yinghua LiYaping ZhuYana ZhangCheng Yang

Communications in computer and information science Year: 2024 Pages: 193-207
JOURNAL ARTICLE

Attention-Aligned Transformer for Image Captioning

Zhengcong Fei

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2022 Vol: 36 (1)Pages: 607-615
© 2026 ScienceGate Book Chapters — All rights reserved.