Abstract

We tackle the tasks of image and text retrieval using a dual-encoder model in which images and text are encoded independently. This model has attracted attention as an approach that enables efficient offline inferences by connecting both vision and language in the same semantic space; however, whether an image encoder as part of a dual-encoder model can interpret scene-text (i.e., the textual information in images) is unclear.We propose pre-training methods that encourage a joint understanding of the scene-text and surrounding visual information.The experimental results demonstrate that our methods improve the retrieval performances of the dual-encoder models.

Keywords:
Encoder Computer science Dual (grammatical number) Artificial intelligence Computer vision Image (mathematics) Image retrieval Encoding (memory) Natural language processing Information retrieval Linguistics

Metrics

8
Cited By
0.99
FWCI (Field Weighted Citation Impact)
36
Refs
0.73
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Text Gestalt: Stroke-Aware Scene Text Image Super-resolution

Jingye ChenHaiyang YuJianqi MaBin LiXiangyang Xue

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2022 Vol: 36 (1)Pages: 285-293
JOURNAL ARTICLE

Asymmetric bi-encoder for image–text retrieval

Wei XiongHaoliang LiuSiya MiYu Zhang

Journal:   Multimedia Systems Year: 2023 Vol: 29 (6)Pages: 3805-3818
© 2026 ScienceGate Book Chapters — All rights reserved.