Scene-Text Aware Image and Text Retrieval with Dual-Encoder

Shumpei Miyawaki; Taku Hasegawa; Kyosuke Nishida; Takuma Kato; Jun Suzuki

doi:10.18653/v1/2022.acl-srw.34

ScienceGate Book Chapters

JOURNAL ARTICLE

Scene-Text Aware Image and Text Retrieval with Dual-Encoder

Shumpei Miyawaki Taku Hasegawa Kyosuke Nishida Takuma Kato Jun Suzuki

Year: 2022

DOI: 10.18653/v1/2022.acl-srw.34

Get Full-Text PDF Get Analytical Report

Abstract

We tackle the tasks of image and text retrieval using a dual-encoder model in which images and text are encoded independently. This model has attracted attention as an approach that enables efficient offline inferences by connecting both vision and language in the same semantic space; however, whether an image encoder as part of a dual-encoder model can interpret scene-text (i.e., the textual information in images) is unclear.We propose pre-training methods that encourage a joint understanding of the scene-text and surrounding visual information.The experimental results demonstrate that our methods improve the retrieval performances of the dual-encoder models.

Keywords:

Encoder Computer science Dual (grammatical number) Artificial intelligence Computer vision Image (mathematics) Image retrieval Encoding (memory) Natural language processing Information retrieval Linguistics

Metrics

Cited By

0.99

FWCI (Field Weighted Citation Impact)

Refs

0.73

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Scene-Text Aware Image and Text Retrieval with Dual-Encoder

Abstract

Metrics

Citation History

Topics

Related Documents

Scene Text Aware Image Retargeting

Split-net: Dual transformer encoder with splitting scene text image for script identification

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Text Gestalt: Stroke-Aware Scene Text Image Super-resolution

Asymmetric bi-encoder for image–text retrieval