Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Ning Han; Jingjing Chen; Guangyi Xiao; Hao Zhang; Yawen Zeng; Hao Chen

doi:10.1145/3474085.3475241

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Ning Han Jingjing Chen Guangyi Xiao Hao Zhang Yawen Zeng Hao Chen

Year: 2021 Pages: 3826-3834

DOI: 10.1145/3474085.3475241

Get Full-Text PDF Get Analytical Report

Abstract

Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level textual and video representation. The fine-grained interactions between video segments and phrases are usually neglected in cross-modal learning, which results in suboptimal retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal Alignment Network (FCA-Net), which considers the interactions between visual semantic units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal alignment. Specifically, the interactions between visual semantic units and phrases are formulated as a link prediction problem optimized by a graph auto-encoder to obtain the explicit relations between them and enhance the aligned feature representation for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2, and VATEX datasets demonstrate the superiority of our model as compared to the state-of-the-art method.

Keywords:

Computer science Modal Embedding Representation (politics) Encoder Artificial intelligence Feature learning Feature (linguistics) Graph Autoencoder Natural language processing Information retrieval Pattern recognition (psychology) Deep learning Theoretical computer science

Metrics

Cited By

3.68

FWCI (Field Weighted Citation Impact)

Refs

0.94

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-Grained Features Alignment and Fusion for Text-Video Cross-Modal Retrieval

Video-Music Retrieval with Fine-Grained Cross-Modal Alignment

Fine-grained Relationship Alignment Network for Video-Text Retrieval

Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval

Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval