Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Yale Song; Mohammad Soleymani

doi:10.1109/cvpr.2019.00208

ScienceGate Book Chapters

JOURNAL ARTICLE

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Yale Song Mohammad Soleymani

Year: 2019 Pages: 1979-1988

DOI: 10.1109/cvpr.2019.00208

Get Full-Text PDF Get Analytical Report

Abstract

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focus on image-text pairs of data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.

Keywords:

Embedding Computer science Artificial intelligence Injective function Focus (optics) Information retrieval Context (archaeology) Representation (politics) Semantics (computer science) Natural language processing Mathematics

Metrics

247

Cited By

13.25

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval

Multi-view visual semantic embedding for cross-modal image–text retrieval

Learning Controlled Semantic Embedding for Cross-Modal Retrieval

Cross-Modal Semantic Embedding Hashing for Unsupervised Retrieval

Semantic-embedding Guided Graph Network for cross-modal retrieval