JOURNAL ARTICLE

Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

Abstract

Cross-modal retrieval between videos and texts has attracted growing attention due to the rapid growth of user-generated videos on the web. To solve this problem, most approaches try to learn a joint embedding space to measure the cross-modal similarities, while paying little attention to the representation of each modality. Video is more complicated than the commonly used visual feature, since the audio and caption on the screen also contain rich information. Recently, the aggregations of multiple features in videos boost the benchmark of the video-text retrieval system. However, they usually handle each feature independently, which ignores the interchange of high-level semantic relations among these multiple features. Moreover, despite the inter-modal ranking constraint where semantically-similar texts and videos should stay closer, the modality-specific requirement, i.e. two similar videos/texts should have similar representations, is also significant. In this paper, we propose a novel Multi-Feature Graph ATtention Network (MFGATN) for cross-modal video-text retrieval. Specifically, we introduce a multi-feature graph attention module, which enriches the representation of each feature in videos with the interchange of high-level semantic information among them. Moreover, we elaborately design a novel Dual Constraint Ranking Loss (DCRL), which simultaneously considers the inter-modal ranking constraint and the intra-modal structure constraint to preserve both the cross-modal semantic similarity and the modality-specific consistency in the embedding space. Experiments on two datasets, i.e. MSR-VTT and MSVD, demonstrate that our method achieves significant performance gain compared with the state-of-the-arts.

Keywords:
Computer science Feature (linguistics) Modal Embedding Constraint (computer-aided design) Graph Modality (human–computer interaction) Artificial intelligence Feature learning Information retrieval Semantic feature Ranking (information retrieval) Pattern recognition (psychology) Theoretical computer science Mathematics

Metrics

12
Cited By
0.61
FWCI (Field Weighted Citation Impact)
19
Refs
0.69
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.