Jian-qiong XiaoZhiyong ZhouXiaoqing Zhou
In this paper, a self-supervised adversarial network retrieval model (SAHCM) based on attention mechanism is proposed, which takes attention mechanism and adversarial network fusion generation process to cross-modal representation, and has the ability to capture global semantic information and local information of text and image video data at the same time. In order to learn a common representation space for multimode data, experiments show that the proposed method can accurately match image and text description in complex content, reduce the storage space of cross-modal retrieval and improve the efficiency of computation time. And the effect of state-of-the-art is obtained in the cross-modal retrieval experiment on MSCOCO dataset.
Xiaoxiao WangMeiyu LiangXiaowen CaoJunping Du
Shouyong PengTao YaoYing LiGang WangLili WangZhiming Yan
Yi-Fan LiXuan WangLei CuiJiajia ZhangCheng-Kai HuangXuan LuoShuhan Qi
Xi ZhangHanjiang LaiJiashi Feng