Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Yabing Wang; Jianfeng Dong; Tianxiang Liang; Minsong Zhang; Rui Cai; Xun Wang

doi:10.1145/3503161.3548003

ScienceGate Book Chapters

JOURNAL ARTICLE

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Yabing Wang Jianfeng Dong Tianxiang Liang Minsong Zhang Rui Cai Xun Wang

Year: 2022 Journal: Proceedings of the 30th ACM International Conference on Multimedia Pages: 422-433

DOI: 10.1145/3503161.3548003

Get Full-Text PDF Get Analytical Report

Abstract

Despite the recent developments in the field of cross-modal retrieval, there\nhas been less research focusing on low-resource languages due to the lack of\nmanually annotated datasets. In this paper, we propose a noise-robust\ncross-lingual cross-modal retrieval method for low-resource languages. To this\nend, we use Machine Translation (MT) to construct pseudo-parallel sentence\npairs for low-resource languages. However, as MT is not perfect, it tends to\nintroduce noise during translation, rendering textual embeddings corrupted and\nthereby compromising the retrieval performance. To alleviate this, we introduce\na multi-view self-distillation method to learn noise-robust target-language\nrepresentations, which employs a cross-attention module to generate soft\npseudo-targets to provide direct supervision from the similarity-based view and\nfeature-based view. Besides, inspired by the back-translation in unsupervised\nMT, we minimize the semantic discrepancies between origin sentences and\nback-translated sentences to further improve the noise robustness of the\ntextual encoder. Extensive experiments are conducted on three video-text and\nimage-text cross-modal retrieval benchmarks across different languages, and the\nresults demonstrate that our method significantly improves the overall\nperformance without using extra human-labeled data. In addition, equipped with\na pre-trained visual encoder from a recent vision-and-language pre-training\nframework, i.e., CLIP, our model achieves a significant performance gain,\nshowing that our method is compatible with popular pre-training models. Code\nand data are available at https://github.com/HuiGuanLab/nrccr.\n

Keywords:

Computer science Machine translation Robustness (evolution) Encoder Artificial intelligence Natural language processing Sentence Noise (video) Modal Speech recognition Pattern recognition (psychology) Image (mathematics)

Metrics

Cited By

1.59

FWCI (Field Weighted Citation Impact)

102

Refs

0.88

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Abstract

Metrics

Citation History

Topics

Related Documents

Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-Tuning

Noise-Robust Cross-modal Learning for Reliable 2D-3D Retrieval

CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

Cross-lingual Cross-modal Pretraining for Multimodal Retrieval

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval