Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Niluthpol Chowdhury Mithun; Juncheng Li; Florian Metze; Amit K. Roy–Chowdhury

doi:10.1145/3206025.3206064

ScienceGate Book Chapters

JOURNAL ARTICLE

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Niluthpol Chowdhury Mithun Juncheng Li Florian Metze Amit K. Roy–Chowdhury

Year: 2018 Pages: 19-27

DOI: 10.1145/3206025.3206064

Get Full-Text PDF Get Analytical Report

Abstract

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, however, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multimodal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the embedding and propose a modified pairwise ranking loss for the task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.

Keywords:

Computer science Video retrieval Embedding Joint (building) Task (project management) Artificial intelligence Pairwise comparison Ranking (information retrieval) Modal Feature learning Invariant (physics) Modalities Natural language processing Information retrieval Speech recognition

Metrics

250

Cited By

17.18

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Joint embeddings with multimodal cues for video-text retrieval

Learning Joint Embedding for Cross-Modal Retrieval

Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval