Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Yawen Zeng; Da Cao; Xiaochi Wei; Meng Liu; Zhou Zhao; Zheng Qin

doi:10.1109/cvpr46437.2021.00225

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Yawen Zeng Da Cao Xiaochi Wei Meng Liu Zhou Zhao Zheng Qin

Year: 2021 Pages: 2215-2224

DOI: 10.1109/cvpr46437.2021.00225

Get Full-Text PDF Get Analytical Report

Abstract

Given an untrimmed video and a query sentence, cross-modal video moment retrieval aims to rank a video moment from pre-segmented video moment candidates that best matches the query sentence. Pioneering work typically learns the representations of the textual and visual content separately and then obtains the interactions or alignments between different modalities. However, the task of cross-modal video moment retrieval is not yet thoroughly addressed as it needs to further identify the fine-grained differences of video moment candidates with high repeatability and similarity. Moveover, the relation among objects in both video and sentence is intuitive and efficient for understanding semantics but is rarely considered.Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates. Specifically, we first introduce a visual relational graph and a textual relational graph to form relation-aware representations via message propagation. Thereafter, a multi-task pre-training is designed to capture domain-specific knowledge about objects and relations, enhancing the structured visual representation after explicitly defined relation. Finally, the graph matching and boundary regression are employed to perform the cross-modal retrieval. We conduct extensive experiments on two datasets about daily activities and cooking activities, demonstrating significant improvements over state-of-the-art solutions.

Keywords:

Computer science Modal Sentence Information retrieval Graph Artificial intelligence Moment (physics) Natural language processing Theoretical computer science

Metrics

Cited By

6.34

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

Cross-Modal Interaction Network for Video Moment Retrieval

Adversarial Graph Attention Network for Multi-modal Cross-Modal Retrieval

Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

Multi-modal video retrieval