Yabing WangJianfeng DongTianxiang LiangMinsong ZhangRui CaiXun Wang
Despite the recent developments in the field of cross-modal retrieval, there\nhas been less research focusing on low-resource languages due to the lack of\nmanually annotated datasets. In this paper, we propose a noise-robust\ncross-lingual cross-modal retrieval method for low-resource languages. To this\nend, we use Machine Translation (MT) to construct pseudo-parallel sentence\npairs for low-resource languages. However, as MT is not perfect, it tends to\nintroduce noise during translation, rendering textual embeddings corrupted and\nthereby compromising the retrieval performance. To alleviate this, we introduce\na multi-view self-distillation method to learn noise-robust target-language\nrepresentations, which employs a cross-attention module to generate soft\npseudo-targets to provide direct supervision from the similarity-based view and\nfeature-based view. Besides, inspired by the back-translation in unsupervised\nMT, we minimize the semantic discrepancies between origin sentences and\nback-translated sentences to further improve the noise robustness of the\ntextual encoder. Extensive experiments are conducted on three video-text and\nimage-text cross-modal retrieval benchmarks across different languages, and the\nresults demonstrate that our method significantly improves the overall\nperformance without using extra human-labeled data. In addition, equipped with\na pre-trained visual encoder from a recent vision-and-language pre-training\nframework, i.e., CLIP, our model achieves a significant performance gain,\nshowing that our method is compatible with popular pre-training models. Code\nand data are available at https://github.com/HuiGuanLab/nrccr.\n
Rui CaiJianfeng DongTianxiang LiangYonghui LiangYabing WangXun YangXun WangMeng Wang
Aiyu YangYanglin FengYuan SunDezhong PengGuiduo DuanYang Qin
Yabing WangFan WangJianfeng DongHao Luo
Yabing WangLe WangQiang ZhouZhibin WangHao LiGang HuaWei Tang