Biswajit PatraDakshina Ranjan Kisku
This paper proposes a unique approach to enhance image captioning by leveraging an Asynchronous Dual Attention (ADA) mechanism within a Vision Transformer (ViT) based framework. Traditional deep-learning models for image captioning often struggle with multimodal interactions and capturing local-to-global visual contexts, including both prominent and subtle features. To address this, the proposed model integrates global self-attention (ViT-B/16) with a Joint Calibration Module during image encoding to enhance the quality of visual embeddings and combines dynamic step-wise attention (Bahdanau) with a Gated Recurrent Unit (GRU) during decoding. This forms an ADA pipeline that decouples visual and linguistic pathways, allowing adaptive refinement of visual features and more precise alignment with linguistic context. Unlike synchronous attention models, ADA enables dynamic image region selection and improved spatial reasoning through enhanced multimodal interaction, leading to more contextually coherent and informative captions for complex visual scenes. The proposed approach demonstrates consistent improvement over state-of-the-art methods on benchmark datasets, achieving CIDEr scores of 0.946 and 1.364 and SPICE scores of 0.188 and 0.248 for Flickr 30k and MSCOCO datasets, respectively. Additionally, the framework incorporates Google’s text-to-speech synthesis to generate audio captions, enhancing accessibility for visually impaired users.
Yonggong RenJinghan ZhangWenqiang XuYuzhu LinBo FuDang N. H. Thanh
Yinghua LiYaping ZhuYana ZhangCheng Yang