JOURNAL ARTICLE

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Yabing WangJianfeng DongTianxiang LiangMinsong ZhangRui CaiXun Wang

Year: 2022 Journal:   Proceedings of the 30th ACM International Conference on Multimedia Pages: 422-433

Abstract

Despite the recent developments in the field of cross-modal retrieval, there\nhas been less research focusing on low-resource languages due to the lack of\nmanually annotated datasets. In this paper, we propose a noise-robust\ncross-lingual cross-modal retrieval method for low-resource languages. To this\nend, we use Machine Translation (MT) to construct pseudo-parallel sentence\npairs for low-resource languages. However, as MT is not perfect, it tends to\nintroduce noise during translation, rendering textual embeddings corrupted and\nthereby compromising the retrieval performance. To alleviate this, we introduce\na multi-view self-distillation method to learn noise-robust target-language\nrepresentations, which employs a cross-attention module to generate soft\npseudo-targets to provide direct supervision from the similarity-based view and\nfeature-based view. Besides, inspired by the back-translation in unsupervised\nMT, we minimize the semantic discrepancies between origin sentences and\nback-translated sentences to further improve the noise robustness of the\ntextual encoder. Extensive experiments are conducted on three video-text and\nimage-text cross-modal retrieval benchmarks across different languages, and the\nresults demonstrate that our method significantly improves the overall\nperformance without using extra human-labeled data. In addition, equipped with\na pre-trained visual encoder from a recent vision-and-language pre-training\nframework, i.e., CLIP, our model achieves a significant performance gain,\nshowing that our method is compatible with popular pre-training models. Code\nand data are available at https://github.com/HuiGuanLab/nrccr.\n

Keywords:
Computer science Machine translation Robustness (evolution) Encoder Artificial intelligence Natural language processing Sentence Noise (video) Modal Speech recognition Pattern recognition (psychology) Image (mathematics)

Metrics

23
Cited By
1.59
FWCI (Field Weighted Citation Impact)
102
Refs
0.88
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.