JOURNAL ARTICLE

Multimodal Retrieval with Contrastive Pretraining

Hüseyin Fuat AlsanEkrem YildizEge Burak SafdilFurkan ArslanTaner Arsan

Year: 2021 Journal:   2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA) Pages: 1-5

Abstract

In this paper, we present multimodal data retrieval aided with contrastive pretraining. Our approach is to pretrain a contrastive network to assist in multimodal retrieval tasks. We work with multimodal data, which has image and caption (text) pairs. We present a dual encoder deep neural network with the image and text encoder to encode multimodal data (images and text) to represent vectors. These representation vectors are used for similarity-based retrieval. Image encoder is a 2D convolutional network, and text encoder is a recurrent neural network (Long-Short Term Memory). MS-COCO 2014 dataset has both images and captions, and it is used for multimodal training with triplet loss. We used a convolutional Siamese network to compute the similarities between images before the dual encoder training (contrastive pretraining). The advantage is that Siamese networks can aid the retrieval, and we seek to show if Siamese networks can be used in practice. Finally, we investigated the performance of Siamese assisted retrieval with BLEU score metric. We conclude that Siamese can help with image-to-text retrieval tasks.

Keywords:
Computer science Artificial intelligence Encoder Convolutional neural network Image retrieval ENCODE Pattern recognition (psychology) Encoding (memory) Autoencoder Dual (grammatical number) Natural language processing Deep learning Image (mathematics)

Metrics

3
Cited By
0.19
FWCI (Field Weighted Citation Impact)
12
Refs
0.57
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining

Hui SuWeiwei ShiXiaoyu ShenZhou XiaoTuo JiJiarui FangJie Zhou

Journal:   Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2022
JOURNAL ARTICLE

COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems

Shuang MaSai VempralaWenshan WangJayesh K. GuptaYale SongDaniel McDufftAshish Kapoor

Journal:   2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Year: 2022 Pages: 1000-1007
JOURNAL ARTICLE

Multimodal Pain Recognition Based on Contrastive Adversarial Autoencoder Pretraining

Nikolai A. K. SteurFriedhelm Schwenker

Journal:   Machine Learning and Knowledge Extraction Year: 2025 Vol: 7 (4)Pages: 165-165
© 2026 ScienceGate Book Chapters — All rights reserved.