Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Niluthpol Chowdhury Mithun; Rameswar Panda; Evangelos E. Papalexakis; Amit K. Roy–Chowdhury

doi:10.1145/3240508.3240712

ScienceGate Book Chapters

JOURNAL ARTICLE

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Niluthpol Chowdhury Mithun Rameswar Panda Evangelos E. Papalexakis Amit K. Roy–Chowdhury

Year: 2018 Pages: 1856-1864

DOI: 10.1145/3240508.3240712

Get Full-Text PDF Get Analytical Report

Abstract

Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across modalities, most of these methods are plagued by the issue of training with small-scale datasets covering a limited number of images with ground-truth sentences. Moreover, it is extremely expensive to create a larger dataset by annotating millions of images with sentences and may lead to a biased model. Inspired by the recent success of webly supervised learning in deep neural networks, we capitalize on readily-available web images with noisy annotations to learn robust image-text joint representation. Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding. We propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding. Experiments on two standard benchmark datasets demonstrate that our method achieves a significant performance gain in image-text retrieval compared to state-of-the-art approaches.

Keywords:

Computer science Artificial intelligence Embedding Image retrieval Ground truth Deep learning Leverage (statistics) Feature learning Information retrieval Benchmark (surveying) Machine learning Pattern recognition (psychology) Image (mathematics) Natural language processing

Metrics

Cited By

7.36

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

Image-Text Embedding with Hierarchical Knowledge for Cross-Modal Retrieval

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval