Enhancing image captioning with asynchronous dual attention vision transformer

Biswajit Patra; Dakshina Ranjan Kisku

doi:10.1177/1088467x251351627

ScienceGate Book Chapters

JOURNAL ARTICLE

Enhancing image captioning with asynchronous dual attention vision transformer

Biswajit Patra Dakshina Ranjan Kisku

Year: 2025 Journal: Intelligent Data Analysis Publisher: IOS Press

DOI: 10.1177/1088467x251351627

Get Full-Text PDF Get Analytical Report

Abstract

This paper proposes a unique approach to enhance image captioning by leveraging an Asynchronous Dual Attention (ADA) mechanism within a Vision Transformer (ViT) based framework. Traditional deep-learning models for image captioning often struggle with multimodal interactions and capturing local-to-global visual contexts, including both prominent and subtle features. To address this, the proposed model integrates global self-attention (ViT-B/16) with a Joint Calibration Module during image encoding to enhance the quality of visual embeddings and combines dynamic step-wise attention (Bahdanau) with a Gated Recurrent Unit (GRU) during decoding. This forms an ADA pipeline that decouples visual and linguistic pathways, allowing adaptive refinement of visual features and more precise alignment with linguistic context. Unlike synchronous attention models, ADA enables dynamic image region selection and improved spatial reasoning through enhanced multimodal interaction, leading to more contextually coherent and informative captions for complex visual scenes. The proposed approach demonstrates consistent improvement over state-of-the-art methods on benchmark datasets, achieving CIDEr scores of 0.946 and 1.364 and SPICE scores of 0.188 and 0.248 for Flickr 30k and MSCOCO datasets, respectively. Additionally, the framework incorporates Google’s text-to-speech synthesis to generate audio captions, enhancing accessibility for visually impaired users.

Keywords:

Closed captioning Asynchronous communication Transformer Computer science Dual (grammatical number) Computer vision Artificial intelligence Image (mathematics) Electrical engineering Engineering Telecommunications Linguistics Voltage

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.24

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Visual Attention and Saliency Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Enhancing image captioning with asynchronous dual attention vision transformer

Abstract

Metrics

Topics

Related Documents

Dual visual align-cross attention-based image captioning transformer

Dual Transformer with Gated-Attention Fusion for News Disaster Image Captioning

Attention-Aligned Transformer for Image Captioning

Policy Learning-Based Image Captioning With Vision Transformer

Enhancing Boosting-Guided Image Captioning with Scene-Graph-Based Vision Transformer Integrated with GAN