Multimodal Fusion of Transformer with Attention Mechanism for Improved Contextual Image Captioning

V Latha; S. N. Dindi; Vannelaganti Sai Prabhu; Bejjam Praveen; Poddu Abhishek; R. Cristin

doi:10.1109/icdsaai59313.2023.10452488

ScienceGate Book Chapters

JOURNAL ARTICLE

Multimodal Fusion of Transformer with Attention Mechanism for Improved Contextual Image Captioning

V Latha S. N. Dindi Vannelaganti Sai Prabhu Bejjam Praveen Poddu Abhishek R. Cristin

Year: 2023 Pages: 1-6

DOI: 10.1109/icdsaai59313.2023.10452488

Get Full-Text PDF Get Analytical Report

Abstract

Image captioning can provide automatic annotations or labels for images, describing the objects, scenes or activities present in the image. This can be beneficial in applications that require large-scale image analysis such as image recognition, recommendation systems and content classification. Traditional methods for generating image descriptions rely on rule-based systems, which are time consuming and lack the ability to capture the contextual details of complex images. The proposed method combines the power of two state-of the-art models: attention mechanism and Transformer. Attention mechanism is a powerful model for multimodal understanding that combines the domains of vision and language processing. Attention mechanism leverages pertaining on large-scale datasets to capture semantic relationships between images and text. By combining the multimodal understanding of attention mechanism with the sequence modelling capabilities of Transformer, superior performance in image captioning task can be achieved. The model can better capture complicated context dependence in input captions by adding more transformer layers. The suggested system's performance is assessed using BLEU-1,2,3,4 and other objective evaluation criteria.

Keywords:

Closed captioning Computer science Transformer Mechanism (biology) Artificial intelligence Fusion Computer vision Natural language processing Image (mathematics) Linguistics Engineering

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.21

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Visual Attention and Saliency Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Fusion of Transformer with Attention Mechanism for Improved Contextual Image Captioning

Abstract

Metrics

Topics

Related Documents

Multimodal learning with feature fusion transformer for image captioning

Transformer with sparse self‐attention mechanism for image captioning

Dual Transformer with Gated-Attention Fusion for News Disaster Image Captioning

Image captioning based on an improved attention mechanism

Cross on Cross Attention: Deep Fusion Transformer for Image Captioning