Anwar ul HaqueSayeed GhaniMuhammad Arif Saeed
The human ability to detect, understand, and contextualize objects in the real world has long been a dream for computer scientists, who have sought to replicate this capability in machines. Image captioning with context and content is a significant research problem. In this work, we attempted to develop a storytelling system that can caption images, considering content, context, syntax, and knowledge. Our methodology combines Capsule Networks for image encoding, Knowledge Graphs for content and context awareness, and Transformer Neural Networks for decoding. During feature extraction, spatial, geometrical, and orientational details are extracted using Capsule Networks. The corpus is passed through the Knowledge Graph to equip it with content, context, and semantics. The decoding phase combines the Knowledge Graph and the Transformer Neural Network for knowledge-driven captioning. Our model is trained on MSCOCO, Flickr 8k, and Flickr 32k, and tested on MSCOCO, Flickr 8k, Flickr 32k, and Google Images. The results provide good content and context understanding with B4: 49.97, M: 39.14, C: 136.53, and R: 74.61. The usage of adverbs and adjectives within the generated sentence, according to the objects’ geometrical and semantic relationship, is phenomenal. The primary outcome of our research is the generation of autonomous story-type captions for real-world images.
Chengzhang ChaiYan GaoGuanyu XiongJiucai LiuHaijiang Li
Huan YangDandan SongLejian Liao
Hui ChenGuiguang DingZijia LinYuchen GuoCaifeng ShanJungong Han