JOURNAL ARTICLE

Visual Image Captioning through Transformer

Muneeb NabiRohit PachauriShouaib AhmadKanishk VarshneyPrachi GoelApurva Jain

Year: 2023 Journal:   International Journal for Research in Applied Science and Engineering Technology Vol: 11 (12)Pages: 2047-2050   Publisher: International Journal for Research in Applied Science and Engineering Technology (IJRASET)

Abstract

Abstract: The convergence of computer vision and natural language processing in Artificial Intelligence has sparked significant interest in recent years, largely propelled by the advancements in deep learning. One notable application born from this synergy is the automatic description of images in English. Image captioning involves the computer's ability to interpret visual information from an image and translate it into one or more descriptive phrases. Generating meaningful descriptions requires understanding the state, properties, and relationships between the depicted objects, demanding a grasp of high-level picture semantics. Automatically captioning images is a complex task that intertwines image analysis with text generation. Central to this process is the concept of attention, determining what to describe and in what sequence. While transformer architectures have shown success in text analysis and translation, adapting them for image captioning presents unique challenges due to structural differences between semantic units in images (usually identified regions from object detection models) and sentences (composed of individual words). Little effort has been devoted to tailoring transformer architectures to suit images' structural characteristics. In this study, we introduce the Image Transformer, a novel architecture comprising a modified encoding transformer and an implicit decoding transformer. Our approach involves expanding the inner architecture of the original transformer layer to better accommodate the structural nuances of images. By utilizing only region features as inputs, our model achieves state-of-the-art performance on the MSCOCO Dataset. This research employing CNN-Transformer architectural models for image captioning aims to detect objects within images and convey information through textual messages. The envisioned application of this method extends to aiding individuals with visual impairments, using text-to-speech messages to facilitate their access to information and nurture their cognitive abilities. This paper meticulously explores fundamental concepts in image captioning and its standardized procedures, introducing a generative CNN-Transformer model as a significant advancement in this field.

Keywords:
Closed captioning Computer science Transformer Artificial intelligence Natural language processing Machine translation Computer vision Speech recognition Image (mathematics) Engineering Voltage

Metrics

1
Cited By
0.18
FWCI (Field Weighted Citation Impact)
7
Refs
0.47
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Visual contextual relationship augmented transformer for image captioning

Qiang SuJunbo HuZhixin Li

Journal:   Applied Intelligence Year: 2024 Vol: 54 (6)Pages: 4794-4813
JOURNAL ARTICLE

Dual-visual collaborative enhanced transformer for image captioning

Zhenping MouTianqi SongLuo Hong

Journal:   Multimedia Systems Year: 2025 Vol: 31 (3)
JOURNAL ARTICLE

Visual spatial relationship sensitive transformer for image captioning

Xianghua PiaoDong JinMin Jung KwonYeong Hyeon Gu

Journal:   Scientific Reports Year: 2025 Vol: 15 (1)Pages: 44581-44581
JOURNAL ARTICLE

INDOOR VISUAL UNDERSTANDING THROUGH IMAGE CAPTIONING

Dhomas Hatta FudholiRoyan Abida N. Nayoan

Journal:   ASEAN Engineering Journal Year: 2024 Vol: 14 (1)Pages: 137-144
JOURNAL ARTICLE

Image Captioning Method Based on Transformer Visual Features Fusion

Xuebing BAI, Jin CHE, Jinman WU, Yumin CHEN

Journal:   DOAJ (DOAJ: Directory of Open Access Journals) Year: 2024
© 2026 ScienceGate Book Chapters — All rights reserved.