Kai WangYifan WangXing XuZuo CaoXunliang Cai
Zero-shot Cross-Modal Retrieval (ZS-CMR) is challenging due to the heterogeneous distributions across different modalities and the inconsistent semantics across seen and unseen classes. Previous methods usually perform class-level semantic alignment of data from different modalities by introducing auxiliary word embeddings of class labels, which have a fatal limitation as the learning of class-level information will lead to the ignorance of intra-modal variance. To solve this problem, we propose our Instance-Level Semantic Alignment (ILSA) method to make full use of the instance-level information. We use two disentanglement variational auto-encoders to decompose the data from two modalities into modal specific and modal invariant features. With an instance-level semantic features extractor and a distribution generator, ILSA could generate more appropriate distributions by the learned instance-level semantic features, without any auxiliary knowledge. We perform the experiment on six widely used datasets on two scenarios of ZS-CMR, the results show that our method establishes the new state-of-the-art performance on all datasets.
Shiping GeZhiwei JiangYafeng YinCong WangZifeng ChengQing Gu
Xing XuKaiyi LinHuimin LuLianli GaoHeng Tao Shen
Chuang LiLunke FeiPeipei KangJiahao LiangXiaozhao FangShaohua Teng
Cheng DengXinxun XuHao WangMuli YangDacheng Tao