Generating sentence descriptions for images is an area of research combining computer vision and natural language processing. More recently, it has been driven by encoder decoder deep learning approaches where visual features are learned with a convolutional neural network (CNN) encoder are passed to a long short-term memory (LSTM) decoder for language generation. One major challenge in this approach is bridging the modality gap between the image and text data to enhance the semantic correctness of the generated sentences. While researchers have explored different features to achieve this, scene exploratory features have been largely underutilised and where utilised have been deployed with unidirectional LSTM decoders limited to retaining only past information thus producing poor results for long sequences. This research adopts a novel approach leveraging scene information deployed with a bidirectional LSTM decoder to achieve more semantically correct image descriptions. Pretrained CNNs Inceptionv3 and Places365 are employed for object and scene image feature extractions respectively before a bidirectional LSTM decoder is employed for language translation. This approach is validated by conducting experiments using the Flickr8k benchmark dataset and the results show improved performance compared to other encoder-decoder methods using just global image features thereby outlining the complementary advantages of scene information and bidirectional LSTMs to image captioning tasks.
Farnaz HoseiniAnaram Yaghoobi Notash
Usman ZiaMuhammad Mohsin RiazAbdul Ghafoor
Witchaphon TieanchoSopon Phumeechanya
Yufis AzharM. Randy AnugerahMuhammad Al Reza FahlopyAlfin Yusriansyah