JOURNAL ARTICLE

Visual-Semantic Refinement Network: Towards Exploring the Capabilities of Decoder in Scene Text Recognition

Abstract

Traditional scene text recognition (STR) is usually regarded as a visual unimodal recognition task, which has made some progress using the encoder-decoder framework. Introducing the language model (LM) that taps into semantic contextual relationships has significantly promoted the task from the language modality. However, in existing works, LM seriously relies on the output of the decoder in the vision model (VM), and the vision decoder itself lacks semantic and global context awareness. In this paper, we explore the capability of the vision decoder, which is generally ignored in previous works. We propose a Visual-Semantic Refinement Network (VSRN) to provide context and semantic guidance to the decoder, fully supporting the recognition capability. With the semantic refine module, the recognition results in the LM, in return, can be introduced to the VM. It provides semantic information while further facilitating the union of these two modalities. In the visual refinement module, we propose an adaptive mask strategy and explore visual features' global contextual relationships to assist the VM further. The two complementary clues jointly promote the VM and iteratively improve the recognition performance. Experimental results on several scene text recognition benchmarks show that our proposed method is effective and achieves state-of-the-art performance.

Keywords:
Computer science Artificial intelligence Natural language processing Text recognition Information retrieval Image (mathematics)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
43
Refs
0.19
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Processing and 3D Reconstruction
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Hierarchical visual-semantic interaction for scene text recognition

Liang DiaoXin TangJun WangGuotong XieJunlin Hu

Journal:   Information Fusion Year: 2023 Vol: 102 Pages: 102080-102080
JOURNAL ARTICLE

Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Ayan Kumar BhuniaAneeshan SainAmandeep KumarShuvozit GhosePinaki Nath ChowdhuryYi-Zhe Song

Journal:   2021 IEEE/CVF International Conference on Computer Vision (ICCV) Year: 2021 Pages: 14920-14929
JOURNAL ARTICLE

Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

Xinjian GaoYe PangYuyu LiuMaokun HanJun YuWei WangYuanxu Chen

Journal:   ACM Transactions on Multimedia Computing Communications and Applications Year: 2024 Vol: 20 (7)Pages: 1-18
© 2026 ScienceGate Book Chapters — All rights reserved.