Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level textual and video representation. The fine-grained interactions between video segments and phrases are usually neglected in cross-modal learning, which results in suboptimal retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal Alignment Network (FCA-Net), which considers the interactions between visual semantic units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal alignment. Specifically, the interactions between visual semantic units and phrases are formulated as a link prediction problem optimized by a graph auto-encoder to obtain the explicit relations between them and enhance the aligned feature representation for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2, and VATEX datasets demonstrate the superiority of our model as compared to the state-of-the-art method.
Shuili ZhangHongzhang MuQuangang LiChenglong XiaoTingwen Liu
Yuki EraRen TogoKeisuke MaedaTakahiro OgawaRen Togo
Min ZhengYue WangChunpeng WuZhaogang HanWeiwei LiuKe Chang
Hui LiuGang LvYanhong GuFudong Nian
Shiping GeZhiwei JiangYafeng YinCong WangZifeng ChengQing Gu