JOURNAL ARTICLE

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Ning HanJingjing ChenHao ZhangHuan-Wen WangHao Chen

Year: 2022 Journal:   ACM Transactions on Multimedia Computing Communications and Applications Vol: 18 (2)Pages: 1-23   Publisher: Association for Computing Machinery

Abstract

Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

Keywords:
Computer science Embedding Feature learning Artificial intelligence Bridging (networking) Convolutional neural network Deep learning Adversarial system Representation (politics) Relation (database) Modal Information retrieval Machine learning Data mining

Metrics

15
Cited By
1.86
FWCI (Field Weighted Citation Impact)
66
Refs
0.84
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.