JOURNAL ARTICLE

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Abstract

Given an untrimmed video and a query sentence, cross-modal video moment retrieval aims to rank a video moment from pre-segmented video moment candidates that best matches the query sentence. Pioneering work typically learns the representations of the textual and visual content separately and then obtains the interactions or alignments between different modalities. However, the task of cross-modal video moment retrieval is not yet thoroughly addressed as it needs to further identify the fine-grained differences of video moment candidates with high repeatability and similarity. Moveover, the relation among objects in both video and sentence is intuitive and efficient for understanding semantics but is rarely considered.Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates. Specifically, we first introduce a visual relational graph and a textual relational graph to form relation-aware representations via message propagation. Thereafter, a multi-task pre-training is designed to capture domain-specific knowledge about objects and relations, enhancing the structured visual representation after explicitly defined relation. Finally, the graph matching and boundary regression are employed to perform the cross-modal retrieval. We conduct extensive experiments on two datasets about daily activities and cooking activities, demonstrating significant improvements over state-of-the-art solutions.

Keywords:
Computer science Modal Sentence Information retrieval Graph Artificial intelligence Moment (physics) Natural language processing Theoretical computer science

Metrics

78
Cited By
6.34
FWCI (Field Weighted Citation Impact)
66
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

Xiang FangDaizong LiuPan ZhouYuchong Hu

Journal:   IEEE Transactions on Multimedia Year: 2022 Vol: 25 Pages: 7517-7532
JOURNAL ARTICLE

Cross-Modal Interaction Network for Video Moment Retrieval

Ping ShenXiao JiangZean TianRonghui CaoWeiming ChiShenghong Yang

Journal:   International Journal of Pattern Recognition and Artificial Intelligence Year: 2023 Vol: 37 (08)
DISSERTATION

Multi-modal video retrieval

Luca Rossetto

University:   edoc (University of Basel) Year: 2018
© 2026 ScienceGate Book Chapters — All rights reserved.