Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Keran Wang; Hongtao Xie; Yuxin Wang; Dongming Zhang; Yadong Qu; Zuan Gao; Yongdong Zhang

doi:10.1145/3581783.3612370

ScienceGate Book Chapters

JOURNAL ARTICLE

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Keran Wang Hongtao Xie Yuxin Wang Dongming Zhang Yadong Qu Zuan Gao Yongdong Zhang

Year: 2023 Pages: 2006-2015

DOI: 10.1145/3581783.3612370

Get Full-Text PDF Get Analytical Report

Abstract

Scene text detection has made great progress recently with the wide use of pre-training. Nonetheless, existing scene text detection methods still suffer from two problems: 1) Limited annotated real data reduces the feature robustness. 2) Detectors perform poorly on text lacking of visual information. In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. Different from previous randomly pixel-level masking methods, MTM performs a targeted text-aware masking process under an unsupervised manner. Specifically, MTM consists of text perception and masked text modeling. In the text perception step, benefiting from the text-friendliness of CLIP, a Text Perception Module is proposed to attend to text area by computing the similarity between the text and image tokens from CLIP model. In the masked text modeling step, a Text-aware Masking Strategy is designed to mask the text area, and the Masked Text Modeling Module is used to reconstruct the masked texts. MTM obtains the ability to reason the linguistic information of masked texts with the reconstruction. This robust feature extraction learned by MTM ensures a more discriminative representation for the text lacking of visual information. Moreover, a new text dataset named OcclusionText is proposed to evaluate the robustness for text occlusion of detection methods. Extensive experiments on public benchmarks demonstrate that our MTM can boost the performance of existing text detectors.

Keywords:

Computer science Robustness (evolution) Text recognition Artificial intelligence Text detection Noisy text analytics Natural language processing Discriminative model Masking (illustration) Pattern recognition (psychology) Perception Speech recognition Text mining Image (mathematics) Text graph

Metrics

Cited By

2.00

FWCI (Field Weighted Citation Impact)

Refs

0.84

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Vehicle License Plate Recognition

Physical Sciences → Engineering → Media Technology

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Abstract

Metrics

Citation History

Topics

Related Documents

Masked Text Pre-Training for Scene Text Detection

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Self-supervised Pre-training of Text Recognizers

Masked Self-supervised Pre-training for Text Recognition Transformers on Large-Scale Datasets

Self-supervised Mutual Learning for Scene Text Detection