Nguyễn Văn ThịnhLang TranThanh The Van
Image captioning is an important task that bridges computer vision and natural language processing. However, methods based on long short-term memory (LSTM) and traditional attention mechanisms are limited in handling complex relationships and parallelization capabilities. Moreover, accurately describing objects that have yet to appear in the training set poses a significant challenge. This study proposes a novel image captioning model, utilizing Transformer with cross-attention mechanisms combined with semantic knowledge from ConceptNet to address these issues. The model adopts an encoder-decoder framework, where the encoder extracts object region features and constructs a relational graph to represent the image, while the decoder integrates visual and semantic features through cross-attention to generate precise and diverse captions. Integrating ConceptNet knowledge enhances accuracy, particularly for objects not present in the training set. Experimental results on the MS COCO, a benchmark dataset, demonstrate that the model outperforms recent state-of-the-art approaches. Furthermore, this study's semantic knowledge integration method can be easily applied to other image captioning models.
Quanzeng YouHailin JinZhaowen WangFang ChenJiebo Luo
Shiwei WangLong LanXiang ZhangZhigang Luo
Wenzhe HuLanxiao WangLinfeng Xu