Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Ning Han; Jingjing Chen; Hao Zhang; Huan-Wen Wang; Hao Chen

doi:10.1145/3483381

ScienceGate Book Chapters

JOURNAL ARTICLE

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Ning Han Jingjing Chen Hao Zhang Huan-Wen Wang Hao Chen

Year: 2022 Journal: ACM Transactions on Multimedia Computing Communications and Applications Vol: 18 (2)Pages: 1-23 Publisher: Association for Computing Machinery

DOI: 10.1145/3483381

Get Full-Text PDF Get Analytical Report

Abstract

Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

Keywords:

Computer science Embedding Feature learning Artificial intelligence Bridging (networking) Convolutional neural network Deep learning Adversarial system Representation (politics) Relation (database) Modal Information retrieval Machine learning Data mining

Metrics

Cited By

1.86

FWCI (Field Weighted Citation Impact)

Refs

0.84

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Region-Aware Cross-Modal Embedding for Fine-Grained Text-To-Video Retrieval

Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval

Multi-label adversarial fine-grained cross-modal retrieval

Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval