JOURNAL ARTICLE

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation

Abstract

Cross-modal retrieval has become a hot research topic in recent years for its theoretical and practical significance. This paper proposes a new technique for learning such deep visual-semantic embedding that is more effective and interpretable for cross-modal retrieval. The proposed method employs a two-stage strategy to fulfill the task. In the first stage, deep mutual information estimation is incorporated into the objective to maximize the mutual information between the input data and its embedding. In the second stage, an expelling branch is added to the network to disentangle the modality-exclusive information from the learned representations. This helps to reduce the impact of modality-exclusive information to the common subspace representation as well as improve the interpretability of the learned feature. Extensive experiments on two large-scale benchmark datasets demonstrate that our method can learn better visual-semantic embedding and achieve state-of-the-art cross-modal retrieval results.

Keywords:
Interpretability Computer science Artificial intelligence Modal Embedding Feature learning Representation (politics) Subspace topology Mutual information Benchmark (surveying) Modality (human–computer interaction) Feature (linguistics) Machine learning Deep learning Pointwise mutual information Pattern recognition (psychology) Information retrieval

Metrics

33
Cited By
1.60
FWCI (Field Weighted Citation Impact)
46
Refs
0.87
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

BOOK-CHAPTER

Variational Deep Representation Learning for Cross-Modal Retrieval

Chen YangZongyong DengTianyu LiHao LiuLibo Liu

Lecture notes in computer science Year: 2021 Pages: 498-510
JOURNAL ARTICLE

Disentangled Speaker Representation Learning via Mutual Information Minimization

Sung Hwan MunMin Hyun HanMinchan KimDongjune LeeNam Soo Kim

Journal:   2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Year: 2022 Pages: 89-96
© 2026 ScienceGate Book Chapters — All rights reserved.