JOURNAL ARTICLE

Multimodal Fusion of Transformer with Attention Mechanism for Improved Contextual Image Captioning

Abstract

Image captioning can provide automatic annotations or labels for images, describing the objects, scenes or activities present in the image. This can be beneficial in applications that require large-scale image analysis such as image recognition, recommendation systems and content classification. Traditional methods for generating image descriptions rely on rule-based systems, which are time consuming and lack the ability to capture the contextual details of complex images. The proposed method combines the power of two state-of the-art models: attention mechanism and Transformer. Attention mechanism is a powerful model for multimodal understanding that combines the domains of vision and language processing. Attention mechanism leverages pertaining on large-scale datasets to capture semantic relationships between images and text. By combining the multimodal understanding of attention mechanism with the sequence modelling capabilities of Transformer, superior performance in image captioning task can be achieved. The model can better capture complicated context dependence in input captions by adding more transformer layers. The suggested system's performance is assessed using BLEU-1,2,3,4 and other objective evaluation criteria.

Keywords:
Closed captioning Computer science Transformer Mechanism (biology) Artificial intelligence Fusion Computer vision Natural language processing Image (mathematics) Linguistics Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
15
Refs
0.21
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Multimodal learning with feature fusion transformer for image captioning

Wenqing ZhuFeiniu Yuan

Journal:   Displays Year: 2025 Vol: 90 Pages: 103126-103126
JOURNAL ARTICLE

Transformer with sparse self‐attention mechanism for image captioning

Duofeng WangHaifeng HuDihu Chen

Journal:   Electronics Letters Year: 2020 Vol: 56 (15)Pages: 764-766
BOOK-CHAPTER

Dual Transformer with Gated-Attention Fusion for News Disaster Image Captioning

Yinghua LiYaping ZhuYana ZhangCheng Yang

Communications in computer and information science Year: 2024 Pages: 193-207
JOURNAL ARTICLE

Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Jing ZhangYingshuai XieWeichao DingZhe Wang

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2023 Vol: 33 (8)Pages: 4257-4268
© 2026 ScienceGate Book Chapters — All rights reserved.