JOURNAL ARTICLE

Cross-Modal Remote Sensing Image–Audio Retrieval With Adaptive Learning for Aligning Correlation

Jinghao HuangYaxiong ChenShengwu XiongXiaoqiang Lu

Year: 2024 Journal:   IEEE Transactions on Geoscience and Remote Sensing Vol: 62 Pages: 1-13   Publisher: Institute of Electrical and Electronics Engineers

Abstract

An important challenge that existing work has yet to address is the relatively small differences in audio representations compared to the rich content provided by remote sensing images, making it easy to overlook certain details in the images. This imbalance in information between modalities poses a challenge in maintaining consistent representations. In response to this challenge, we propose a novel cross-modal RSIA retrieval method called Adaptive Learning for Aligning Correlation (ALAC). ALAC integrates region-level learning into image annotation through a region-enhanced learning attention module. By collaboratively suppressing features at different region levels, ALAC is able to provide a more comprehensive visual feature representation. Additionally, a novel adaptive knowledge transfer strategy has been proposed, which guides the learning process of the frontend network using aligned feature vectors. This approach allows the model to adaptively acquire alignment information during the learning process, thereby facilitating better alignment between the two modalities. Finally, to better utilize mutual information between different modalities, we introduce a plug-and-play result rerank module. This module optimizes the similarity matrix by using retrieval mutual information between modalities as weights, significantly improving retrieval accuracy. Experimental results on four RSIA datasets demonstrate that ALAC outperforms other methods in retrieval performance. Compared to state-of-the-art methods, improvements of 1.49%, 2.25%, 4.24% and 1.33% were respectively achieved by ALAC. The codes are accessible at https://github.com/huangjh98/ALAC.

Keywords:
Computer science Modalities Feature (linguistics) Modality (human–computer interaction) Process (computing) Similarity (geometry) Mutual information Feature learning Artificial intelligence Modal Representation (politics) Pattern recognition (psychology) Image (mathematics)

Metrics

2
Cited By
1.43
FWCI (Field Weighted Citation Impact)
54
Refs
0.69
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.