Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

Xiaoshuai Hao; Yucan Zhou; Dayan Wu; Wanqian Zhang; Bo Li; Weiping Wang

doi:10.1145/3460426.3463608

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

Xiaoshuai Hao Yucan Zhou Dayan Wu Wanqian Zhang Bo Li Weiping Wang

Year: 2021 Pages: 135-143

DOI: 10.1145/3460426.3463608

Get Full-Text PDF Get Analytical Report

Abstract

Cross-modal retrieval between videos and texts has attracted growing attention due to the rapid growth of user-generated videos on the web. To solve this problem, most approaches try to learn a joint embedding space to measure the cross-modal similarities, while paying little attention to the representation of each modality. Video is more complicated than the commonly used visual feature, since the audio and caption on the screen also contain rich information. Recently, the aggregations of multiple features in videos boost the benchmark of the video-text retrieval system. However, they usually handle each feature independently, which ignores the interchange of high-level semantic relations among these multiple features. Moreover, despite the inter-modal ranking constraint where semantically-similar texts and videos should stay closer, the modality-specific requirement, i.e. two similar videos/texts should have similar representations, is also significant. In this paper, we propose a novel Multi-Feature Graph ATtention Network (MFGATN) for cross-modal video-text retrieval. Specifically, we introduce a multi-feature graph attention module, which enriches the representation of each feature in videos with the interchange of high-level semantic information among them. Moreover, we elaborately design a novel Dual Constraint Ranking Loss (DCRL), which simultaneously considers the inter-modal ranking constraint and the intra-modal structure constraint to preserve both the cross-modal semantic similarity and the modality-specific consistency in the embedding space. Experiments on two datasets, i.e. MSR-VTT and MSVD, demonstrate that our method achieves significant performance gain compared with the state-of-the-arts.

Keywords:

Computer science Feature (linguistics) Modal Embedding Constraint (computer-aided design) Graph Modality (human–computer interaction) Artificial intelligence Feature learning Information retrieval Semantic feature Ranking (information retrieval) Pattern recognition (psychology) Theoretical computer science Mathematics

Metrics

Cited By

0.61

FWCI (Field Weighted Citation Impact)

Refs

0.69

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Adversarial Graph Attention Network for Multi-modal Cross-Modal Retrieval

Text-Enhanced Graph Attention Hashing for Cross-Modal Retrieval

Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval