The automatic generation of image descriptions is leading the field of computer vision and natural language processing-based research. Image captioning is a key task that calls for a semantic understanding of the images and the capacity to create descriptions with right structure. Image captioning is a complex problem as it often demands accessing data that might not be visible in each scene. It will require logical thinking to evaluate or have in-depth knowledge about the object present in an image. In this study, we developed a multilayer Convolutional Neural Network to produce words that describe the images, and we used Long Short-Term Memory to accurately construct relevant sentences out of the words that are produced. To generate an accurate description, the Convolutional Neural Network (CNN) model first compares the targeted image against a huge dataset of training samples. In this study, we have used the Flickr 8k dataset. We have used the Bilingual Evaluation Understudy (BLEU) metric to determine how well our model is generating captions for the images. It evaluates the generated text that has been translated from one language to a different language to evaluate the effectiveness of the machine translation system. In this study, we have also used two pre-trained models (VGG16, and XceptionV3) for comparative study.
Junaid Ahmad WaniSahilpreet Singh
Rizwan SayyedAkash SatputeTushar VarkhedePrasad ZorePriya Surana
R. RamyaS. VidhyaM. PreethiR. Rajalakshmi
Rupendra Kumar KaushikSushil Kumar Sharma