JOURNAL ARTICLE

Read, Spell and Repeat: Scene Text Recognition with Vision-Language Circular Refinement

Abstract

Scene Text Recognition (STR) has long been considered an important yet challenging task in the field of computer vision. Recent works have demonstrated that utilizing language information is effective for the visually difficult images, like ones with occultation or blurring. However, the use of language information sometimes leads to the over-correction problem. For out-of-vocabulary samples (e.g. "hou" and "0x4a"), some methods have tended to be biased to language side and over-corrected (e.g. over-correct "hou" to "hot"). This imbalance of vision and language has limited the usage of models in practical scenarios, yet it is rarely occurs for human. To address this issue, we rethink the human's recognition process and propose a model behaving in the order of "Read, Spell and Repeat". It refines the recognition process circularly with vision and language information. With this mechanism, our model integrates vision and language information in a more effective manner, achieving higher accuracy with less parameters compared to baseline and competitive performance with SOTA methods in the standard benchmarks.

Keywords:
Computer science Spell Vocabulary Language model Artificial intelligence Natural language processing Process (computing) Task (project management) Speech recognition Baseline (sea) Computer vision Linguistics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
24
Refs
0.03
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Processing and 3D Reconstruction
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.