Bidirectional LSTM approach to image captioning with scene features

Davis Agughalam; Pramod Pathak; Paul Stynes

doi:10.1117/12.2600465

ScienceGate Book Chapters

JOURNAL ARTICLE

Bidirectional LSTM approach to image captioning with scene features

Davis Agughalam Pramod Pathak Paul Stynes

Year: 2021 Pages: 39-39

DOI: 10.1117/12.2600465

Get Full-Text PDF Get Analytical Report

Abstract

Generating sentence descriptions for images is an area of research combining computer vision and natural language processing. More recently, it has been driven by encoder decoder deep learning approaches where visual features are learned with a convolutional neural network (CNN) encoder are passed to a long short-term memory (LSTM) decoder for language generation. One major challenge in this approach is bridging the modality gap between the image and text data to enhance the semantic correctness of the generated sentences. While researchers have explored different features to achieve this, scene exploratory features have been largely underutilised and where utilised have been deployed with unidirectional LSTM decoders limited to retaining only past information thus producing poor results for long sequences. This research adopts a novel approach leveraging scene information deployed with a bidirectional LSTM decoder to achieve more semantically correct image descriptions. Pretrained CNNs Inceptionv3 and Places365 are employed for object and scene image feature extractions respectively before a bidirectional LSTM decoder is employed for language translation. This approach is validated by conducting experiments using the Flickr8k benchmark dataset and the results show improved performance compared to other encoder-decoder methods using just global image features thereby outlining the complementary advantages of scene information and bidirectional LSTMs to image captioning tasks.

Keywords:

Closed captioning Computer science Encoder Artificial intelligence Sentence Convolutional neural network Computer vision Image (mathematics) Object (grammar) Speech recognition Pattern recognition (psychology)

Metrics

Cited By

0.51

FWCI (Field Weighted Citation Impact)

Refs

0.63

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Bidirectional LSTM approach to image captioning with scene features

Abstract

Metrics

Citation History

Topics

Related Documents

Image captioning using bidirectional LSTM neural network

Image Captioning using Hybrid LSTM-RNN with Deep Features

Topic Guided Image Captioning with Scene and Spatial Features

Enhancing THAI Image Captioning Performance using CNN and Bidirectional LSTM

Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model