Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning

W Zhang; Jihao Li; Shuoke Li; Jialiang Chen; Wenkai Zhang; Xin Gao; Xian Sun

doi:10.1109/tgrs.2023.3318227

ScienceGate Book Chapters

JOURNAL ARTICLE

Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning

W Zhang Jihao Li Shuoke Li Jialiang Chen Wenkai Zhang Xin Gao Xian Sun

Year: 2023 Journal: IEEE Transactions on Geoscience and Remote Sensing Vol: 61 Pages: 1-15 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tgrs.2023.3318227

Get Full-Text PDF Get Analytical Report

Abstract

Remote sensing cross-modal text-image retrieval (RSCTIR) is a flexible and human-centered approach to retrieving rich information from different modalities, which has attracted plenty of attention in recent years. It remains challenging because the current methods usually ignore the varying difficulty levels of different sample pairs, stemming from the large image distribution difference and the high text similarity in the remote sensing (RS) field. Therefore, in this paper, we propose an innovative hypersphere-based visual semantic alignment (HVSA) network via curriculum learning. Specifically, we first design an adaptive alignment strategy based on curriculum learning, that aligns RS image-text pairs from easy to hard. Sample pairs with different levels of difficulty are treated unequally, and we obtain a better embedding representation when projecting the features onto the unit hypersphere. Then, to measure the robustness of cross-modal feature alignment on the unit hypersphere, we introduce the feature uniformity strategy. It reduces the occurrence of mismatching cases and improves generalization performance. Finally, we design the key-entity attention (KEA) mechanism to alleviate the problem of information imbalance among different modalities. KEA has the ability to extract information about the key entity which is aligned with textual information. Despite its conciseness, our framework achieves state-of-the-art performance on classical datasets of RSCTIR tasks while enjoying faster inference. The summed recall of HVSA on the RISCD and RSITMD is 120.97 and 198.94, 2.50 and 10.49 points ahead of the current best methods, respectively. Extensive experiments demonstrate the competitiveness of our method. The code has been released at https://github.com/ZhangWeihang99/HVSA.

Keywords:

Hypersphere Computer science Artificial intelligence Feature learning Inference Pattern recognition (psychology) Robustness (evolution) Feature extraction Feature (linguistics) Embedding MNIST database Machine learning Deep learning

Metrics

Cited By

6.73

FWCI (Field Weighted Citation Impact)

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning

Abstract

Metrics

Citation History

Topics

Related Documents

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

CECMR: Calibrated Evidential Learning For Cross Modal Remote Sensing Image-Text Retrieval

Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering

A Review of Cross-Modal Image–Text Retrieval in Remote Sensing