In this work, we address a cross-modal retrieval problem in remote sensing (RS) data. A cross-modal retrieval problem is more challenging than the conventional uni-modal data retrieval frameworks as it requires learning of two completely different data representations to map onto a shared feature space. For this purpose, we chose a photo-sketch RS database. We exploit the data modality comprising more spatial information (sketch) to extract the other modality features (photo) with cross-attention networks. This sketch-attended photo features are more robust and yield better retrieval results. We validate our proposal by performing experiments on the benchmarked Earth on Canvas dataset. We show a boost in the overall performance in comparison to the existing literature. Besides, we also display the Grad-CAM visualizations of the trained model's weights to highlight the framework's efficacy.
Xiaoyu YangChao LiZhiming WangHao XieJunyi MaoGuangqiang Yin
Yaxiong ChenXiaoqiang LuShuai Wang
Rui YangYu GuYu LiaoHuan ZhangYingzhi SunShuang WangBiao HouLicheng JiaoHe Zhang
Yichao ZhangXiangtao ZhengXiaoqiang Lu