Image Captioning - A Deep Learning Approach Using CNN and LSTM Network

Preeti Voditel; Aparna Gurjar; Aakansha Pandey; Akrati Jain; Nandita Sharma; Nisha Dubey

doi:10.1109/icpcsn58827.2023.00062

ScienceGate Book Chapters

JOURNAL ARTICLE

Image Captioning - A Deep Learning Approach Using CNN and LSTM Network

Preeti Voditel Aparna Gurjar Aakansha Pandey Akrati Jain Nandita Sharma Nisha Dubey

Year: 2023

DOI: 10.1109/icpcsn58827.2023.00062

Get Full-Text PDF Get Analytical Report

Abstract

An image caption generator is a system that uses artificial intelligence and computer vision to analyze an image and generate a written description or a caption. The caption gives a brief description which accurately aligns with the content of the image. The various elements within the image are recognized and interpreted using deep learning techniques. The process of using datasets to train the model to assign English language labels or descriptors to an image is known as image tagging. Tagging helps identify an image and its description for easier search and retrieval in the future. In our research, a new model is suggested that utilizes an encoder-decoder architecture to generate appropriate and grammatically correct captions for images. This model employs methods from both image analysis and natural language processing/generation to examine and characterize pictures. The goal is to generate accurate captions that precisely convey the content of the images. We utilize a particular deep learning architecture called VGG16 in this method. VGG16 is a Convolutional Neural Network (CNN) that has demonstrated exceptional performance in image recognition. The VGG16 architecture is used as the encoding layer to extract important features from the image. After the VGG16 model processes the image, the results are fed into an LSTM (Long Short-Term Memory), a type of recurrent neural network, which then predicts or generates a textual description of the image, one word at a time. For generating accurate captions, it is trained on a set of labelled images and their corresponding captions called the Flickr8k Captions dataset. This dataset is used to provide the model with the ground truth captions. After the training phase is completed, the model creates descriptions for a group of test images. These generated captions are then compared to the actual captions present in the test dataset. The comparison is done using a metric called the BLEU score, which is a measure of the accuracy of the generated captions. The effectiveness of the model is determined based on this score.

Keywords:

Computer science Closed captioning Artificial intelligence Convolutional neural network Deep learning Image (mathematics) Pattern recognition (psychology) Word (group theory) Process (computing) Natural language processing Computer vision

Metrics

Cited By

1.46

FWCI (Field Weighted Citation Impact)

Refs

0.79

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Captioning - A Deep Learning Approach Using CNN and LSTM Network

Abstract

Metrics

Citation History

Topics

Related Documents

Image Captioning Using LSTM and CNN: A Deep Learning Approach

Advanced Image Captioning Using Deep Learning Techniques: A CNN-LSTM Approach

Captioning Image Using Deep Learning Approach

Image Captioning Using Deep Learning Techniques Like Cnn-Lstm

Deep Learning-Based Image Captioning: A Hybrid CNN-LSTM Approach.