Image captioning technology has become an important solution for intelligent robots to understand image content. How to extract image information effectively is the key to generate accurate and reliable captions. In this paper, we propose a dual self-attention based network (DSAN) for image captioning. Specifically, we design a Dual Self-Attention Module (DSAM) embedded into an encoding-decoding architecture to capture the contextual information in the image, which can adaptively integrate local features with global dependencies. The DSAM can significantly improve the caption results by modeling rich contextual dependencies over local features. Experimental results on the MS COCO dataset show that the proposed DSAN can achieve better performance than existing methods.
Boyang WanWenhui JiangYuming FangWenying WenHantao Liu
Tarun JaiswalManju PandeyPriyanka Tripathi
Qianxia MaJingyan SongTao Zhang
Anh Cong HoangHoang Long NguyenThi Thuy LeMinh Phong PhanThe Anh PhamDinh Cong Nguyen