JOURNAL ARTICLE

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

Jianan LiXiaoqian LiuXin LuoXin-Shun Xu

Year: 2024 Journal:   IEEE Transactions on Multimedia Vol: 26 Pages: 6437-6448   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Recently, the approaches of linguistic modeling for scene text recognition have gradually become mainstream, mainly consisting of a vision model (VM), a language model (LM), and an optional fusion module. These methods typically use LM and fusion modules to refine the results of VM-based predictions iteratively. However, the VM mainly consists of a Transformer on top of ResNet. It means the attention mechanism is only applied to the high layer of the VM, ignoring the internal image dependencies in the dense features at multiple scales. Therefore, the results in the VM become the performance bottleneck. Meanwhile, the visual and language features of these methods reside in their own space. In this way, it ignores the alignment before fusion, leading to a failure to achieve maximum information interaction. To address these issues, we propose Visual cOllaboration and duaL-stream fusion for scene TExt Recognition, VOLTER for short. Firstly, a multi-stage Local-Global Collaboration Vision Model (LGC-VM) is constructed to focus on both local and global features at multiple scales, breaking vision bottlenecks to provide a better vision prediction. Secondly, to explicitly align the feature space of VM and LM, we introduce a Vision-Language Contrastive (VLC) module by encouraging positive vision-language pairs to have similar representations. Moreover, a Dual-Stream Feature Enhancement (DSFE) module is proposed for the unidirectional interaction of visual-language features to alleviate the alignment problem of different modalities and execute fusion further. Extensive experiments on benchmark datasets demonstrate that the proposed framework can achieve state-of-the-art performance.

Keywords:
Computer science Bottleneck Artificial intelligence Benchmark (surveying) Fusion mechanism Feature (linguistics) Dual (grammatical number) Pattern recognition (psychology) Computer vision Fusion

Metrics

4
Cited By
2.12
FWCI (Field Weighted Citation Impact)
63
Refs
0.78
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Scene Text Recognition With Dual Encoders

Yao WangJong-Eun Ha

Journal:   Journal of Institute of Control Robotics and Systems Year: 2023 Vol: 29 (12)Pages: 973-979
BOOK-CHAPTER

Dual-Stream Based Scene Text Manipulation Detection Method

Jiefu ChenGuofeng Yi

Communications in computer and information science Year: 2025 Pages: 186-194
JOURNAL ARTICLE

Scene recognition with audio-visual sensor fusion

Deepak DevicharanKishan G. MehrotraChilukuri K. MohanPramod K. VarshneyLong Zuo

Journal:   Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE Year: 2005 Vol: 5813 Pages: 201-201
© 2026 ScienceGate Book Chapters — All rights reserved.