Qimin ChengYuzhuo ZhouHaiyan HuangZhongyuan Wang
Dear editor, Cross-modal retrieval in remote sensing (RS) data has inspired increasing enthusiasm due to its merit in flexible input and efficient query. In this letter, we address to establish semantic relationship between RS images and their description sentences. Specially, we propose a multi-attention fusion and fine-grained alignment network, termed MAFA-Net, for bidirectional cross-modal image-sentence retrieval in RS. While multiple attention mechanisms are fused to enhance the discriminative ability of visual features for RS images with complex scenes, fine-grained alignment strategy is introduced to study the hidden connection between RS observations and sentences. To validate the capability of MAFA-Net, we leverage four captioning benchmark datasets with paired RS images and descriptions, i.e., UCM-Captions, Sydney-Captions, RSICD and NWPU-Captions. Experimental results on the four datasets demonstrate that MAFA-Net can yield better performance than the current state-of-the-art approaches.
Shuo LiH. JiFang LiuLicheng JiaoXuli MinXinyan HuangJiahao WangLong SunLingling LiXu Liu
Sen LeiXinyu XiaoTianlin ZhangHeng-Chao LiZhenwei ShiQing Zhu
Fuzhong ZhengXu WangLuyao WangXiong ZhangHongze ZhuLong WangHaisu Zhang