Image captioning stands as a pivotal technique for providing contextual descriptions of visual content, promising substantial enhancement in the capabilities of conversational AI systems. This work delves into the integration of image captioning methodologies into ChatGPT, aiming to fortify its capacity in understanding and responding to visual information. The study extensively explores the application of deep learning models, encompassing ResNet50, LSTM, DenseNet121, MobileNet, and MobileNetv2, in the domain of image captioning. Specifically, a comprehensive investigation is conducted into a Recurrent Neural Network employing LSTM as a decoder and a Convolutional Neural Network utilizing ResNet as an encoder. These fusion harnesses vocabulary and image features to craft precise and meaningful descriptions of visual content. Furthermore, this study pioneers an approach to identify and relate at least two salient features within any given image, forming a coherent caption that binds the relationship between these identified features. This novel capability not only refines image captioning techniques but also empowers ChatGPT to comprehend complex visual contexts within conversational settings. The outcomes of this work offer profound insights into augmenting AI capabilities, facilitating a deeper understanding and more effective interaction with visual information across various domains, thereby advancing the field of conversational AI integration with visual context.
Wei ZhangWenbo NieXinle LiYao Yu
Chiranji Lal ChowdharyAman GoyalBhavesh Kumar Vasnani