JOURNAL ARTICLE

Bidirectional LSTM approach to image captioning with scene features

Abstract

Generating sentence descriptions for images is an area of research combining computer vision and natural language processing. More recently, it has been driven by encoder decoder deep learning approaches where visual features are learned with a convolutional neural network (CNN) encoder are passed to a long short-term memory (LSTM) decoder for language generation. One major challenge in this approach is bridging the modality gap between the image and text data to enhance the semantic correctness of the generated sentences. While researchers have explored different features to achieve this, scene exploratory features have been largely underutilised and where utilised have been deployed with unidirectional LSTM decoders limited to retaining only past information thus producing poor results for long sequences. This research adopts a novel approach leveraging scene information deployed with a bidirectional LSTM decoder to achieve more semantically correct image descriptions. Pretrained CNNs Inceptionv3 and Places365 are employed for object and scene image feature extractions respectively before a bidirectional LSTM decoder is employed for language translation. This approach is validated by conducting experiments using the Flickr8k benchmark dataset and the results show improved performance compared to other encoder-decoder methods using just global image features thereby outlining the complementary advantages of scene information and bidirectional LSTMs to image captioning tasks.

Keywords:
Closed captioning Computer science Encoder Artificial intelligence Sentence Convolutional neural network Computer vision Image (mathematics) Object (grammar) Speech recognition Pattern recognition (psychology)

Metrics

6
Cited By
0.51
FWCI (Field Weighted Citation Impact)
20
Refs
0.63
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Image captioning using bidirectional LSTM neural network

Farnaz HoseiniAnaram Yaghoobi Notash

Journal:   Discover Artificial Intelligence Year: 2025 Vol: 5 (1)
JOURNAL ARTICLE

Image Captioning using Hybrid LSTM-RNN with Deep Features

Kalpana DeorukhkarSatish Ket

Journal:   Sensing and Imaging Year: 2022 Vol: 23 (1)
BOOK-CHAPTER

Topic Guided Image Captioning with Scene and Spatial Features

Usman ZiaMuhammad Mohsin RiazAbdul Ghafoor

Lecture notes in networks and systems Year: 2022 Pages: 180-191
JOURNAL ARTICLE

Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model

Yufis AzharM. Randy AnugerahMuhammad Al Reza FahlopyAlfin Yusriansyah

Journal:   Kinetik Game Technology Information System Computer Network Computing Electronics and Control Year: 2022
© 2026 ScienceGate Book Chapters — All rights reserved.