Abstract

An image caption generator is a system that uses artificial intelligence and computer vision to analyze an image and generate a written description or a caption. The caption gives a brief description which accurately aligns with the content of the image. The various elements within the image are recognized and interpreted using deep learning techniques. The process of using datasets to train the model to assign English language labels or descriptors to an image is known as image tagging. Tagging helps identify an image and its description for easier search and retrieval in the future. In our research, a new model is suggested that utilizes an encoder-decoder architecture to generate appropriate and grammatically correct captions for images. This model employs methods from both image analysis and natural language processing/generation to examine and characterize pictures. The goal is to generate accurate captions that precisely convey the content of the images. We utilize a particular deep learning architecture called VGG16 in this method. VGG16 is a Convolutional Neural Network (CNN) that has demonstrated exceptional performance in image recognition. The VGG16 architecture is used as the encoding layer to extract important features from the image. After the VGG16 model processes the image, the results are fed into an LSTM (Long Short-Term Memory), a type of recurrent neural network, which then predicts or generates a textual description of the image, one word at a time. For generating accurate captions, it is trained on a set of labelled images and their corresponding captions called the Flickr8k Captions dataset. This dataset is used to provide the model with the ground truth captions. After the training phase is completed, the model creates descriptions for a group of test images. These generated captions are then compared to the actual captions present in the test dataset. The comparison is done using a metric called the BLEU score, which is a measure of the accuracy of the generated captions. The effectiveness of the model is determined based on this score.

Keywords:
Computer science Closed captioning Artificial intelligence Convolutional neural network Deep learning Image (mathematics) Pattern recognition (psychology) Word (group theory) Process (computing) Natural language processing Computer vision

Metrics

8
Cited By
1.46
FWCI (Field Weighted Citation Impact)
14
Refs
0.79
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Captioning Image Using Deep Learning Approach

Arpan Sen

Journal:   International Journal for Research in Applied Science and Engineering Technology Year: 2023 Vol: 11 (5)Pages: 7425-7428
JOURNAL ARTICLE

Image Captioning Using Deep Learning Techniques Like Cnn-Lstm

RanjanaB Battur

Journal:   International Journal of Environmental Sciences Year: 2024 Pages: 21-30
JOURNAL ARTICLE

Deep Learning-Based Image Captioning: A Hybrid CNN-LSTM Approach.

V. PravallikaVadduri Uday KiranB. RahulN. NeelimaGyanesh PatnaikDR. Sreejyothshna Ankam

Journal:   International Journal of Research Publication and Reviews Year: 2025 Vol: 6 (4)Pages: 2459-2463
© 2026 ScienceGate Book Chapters — All rights reserved.