Visual Image Captioning through Transformer

Muneeb Nabi; Rohit Pachauri; Shouaib Ahmad; Kanishk Varshney; Prachi Goel; Apurva Jain

doi:10.22214/ijraset.2023.57766

ScienceGate Book Chapters

JOURNAL ARTICLE

Visual Image Captioning through Transformer

Muneeb Nabi Rohit Pachauri Shouaib Ahmad Kanishk Varshney Prachi Goel Apurva Jain

Year: 2023 Journal: International Journal for Research in Applied Science and Engineering Technology Vol: 11 (12)Pages: 2047-2050 Publisher: International Journal for Research in Applied Science and Engineering Technology (IJRASET)

DOI: 10.22214/ijraset.2023.57766

Get Full-Text PDF Get Analytical Report

Abstract

Abstract: The convergence of computer vision and natural language processing in Artificial Intelligence has sparked significant interest in recent years, largely propelled by the advancements in deep learning. One notable application born from this synergy is the automatic description of images in English. Image captioning involves the computer's ability to interpret visual information from an image and translate it into one or more descriptive phrases. Generating meaningful descriptions requires understanding the state, properties, and relationships between the depicted objects, demanding a grasp of high-level picture semantics. Automatically captioning images is a complex task that intertwines image analysis with text generation. Central to this process is the concept of attention, determining what to describe and in what sequence. While transformer architectures have shown success in text analysis and translation, adapting them for image captioning presents unique challenges due to structural differences between semantic units in images (usually identified regions from object detection models) and sentences (composed of individual words). Little effort has been devoted to tailoring transformer architectures to suit images' structural characteristics. In this study, we introduce the Image Transformer, a novel architecture comprising a modified encoding transformer and an implicit decoding transformer. Our approach involves expanding the inner architecture of the original transformer layer to better accommodate the structural nuances of images. By utilizing only region features as inputs, our model achieves state-of-the-art performance on the MSCOCO Dataset. This research employing CNN-Transformer architectural models for image captioning aims to detect objects within images and convey information through textual messages. The envisioned application of this method extends to aiding individuals with visual impairments, using text-to-speech messages to facilitate their access to information and nurture their cognitive abilities. This paper meticulously explores fundamental concepts in image captioning and its standardized procedures, introducing a generative CNN-Transformer model as a significant advancement in this field.

Keywords:

Closed captioning Computer science Transformer Artificial intelligence Natural language processing Machine translation Computer vision Speech recognition Image (mathematics) Engineering Voltage

Metrics

Cited By

0.18

FWCI (Field Weighted Citation Impact)

Refs

0.47

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Visual Image Captioning through Transformer

Abstract

Metrics

Citation History

Topics

Related Documents

Visual contextual relationship augmented transformer for image captioning

Dual-visual collaborative enhanced transformer for image captioning

Visual spatial relationship sensitive transformer for image captioning

INDOOR VISUAL UNDERSTANDING THROUGH IMAGE CAPTIONING

Image Captioning Method Based on Transformer Visual Features Fusion