With the increasing popularity of remote sensing technology applications, some emergency scenarios require rapid retrieval of remote sensing images, such as earthquake rescue, etc. Due to the high efficiency of voice input, researchers have focused on cross-modal remote sensing image-voice retrieval methods. However, these methods have two major drawbacks: speech input lacks discrimination and the intra-modal semantic information is under used. To address these drawbacks, we propose a novel cross-modal feature fusion retrieval model. Our model provides a more optimized cross-modal common feature space than previous models and thus optimizes the retrieval performance. First, our model adds the extra textual keyword information to the audio feature for remote sensing image retrieval. Second, it introduces inter-modality adversarial learning and intra-modality semantic discrimination into the remote sensing image-voice retrieval task. We conducted experiments on two datasets modified from the UCM-Captions dataset and the Remote Sensing Image Caption Dataset. The experimental results show that our model outperforms state-of-the-art models in this task.
Yaxiong ChenXiaoqiang LuShuai Wang
Yichao ZhangXiangtao ZhengXiaoqiang Lu
Fatima AliNaima IltafUsman Zia
Fanglong YaoNayu LiuPeiguang LiDongshuo YinChenglong LiuXian Sun
Ushasi ChaudhuriBiplab BanerjeeAvik BhattacharyaMihai Datcu