Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

Zhongwei Xie; Ling Liu; Yanzhao Wu; Luo Zhong; Lin Li

doi:10.1145/3490519

ScienceGate Book Chapters

JOURNAL ARTICLE

Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

Zhongwei Xie Ling Liu Yanzhao Wu Luo Zhong Lin Li

Year: 2021 Journal: ACM Transactions on Information Systems Vol: 40 (4)Pages: 1-27

DOI: 10.1145/3490519

Get Full-Text PDF Get Analytical Report

Abstract

This article introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce ranking scores for key terms before generating the vector representation for each key term by using Word2vec. We leverage Wide ResNet50 and Word2vec to extract and encode the image category semantics of food images to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, taking into account also the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with deep feature engineering significantly outperforms the state-of-the-art approaches.

Keywords:

Joint (building) Embedding Modal Feature (linguistics) Computer science Artificial intelligence Image (mathematics) Pattern recognition (psychology) Deep learning Feature learning Information retrieval Image retrieval Feature engineering Engineering Structural engineering Materials science

Metrics

Cited By

1.84

FWCI (Field Weighted Citation Impact)

Refs

0.87

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

Abstract

Metrics

Citation History

Topics

Related Documents

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Deep Learning-based Cross-Modal Image-Text Retrieval

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

A Cross-Modal Image-Text Retrieval System with Deep Learning

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding