VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

Jianan Li; Xiaoqian Liu; Xin Luo; Xin-Shun Xu

doi:10.1109/tmm.2024.3350916

ScienceGate Book Chapters

JOURNAL ARTICLE

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

Jianan Li Xiaoqian Liu Xin Luo Xin-Shun Xu

Year: 2024 Journal: IEEE Transactions on Multimedia Vol: 26 Pages: 6437-6448 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tmm.2024.3350916

Get Full-Text PDF Get Analytical Report

Abstract

Recently, the approaches of linguistic modeling for scene text recognition have gradually become mainstream, mainly consisting of a vision model (VM), a language model (LM), and an optional fusion module. These methods typically use LM and fusion modules to refine the results of VM-based predictions iteratively. However, the VM mainly consists of a Transformer on top of ResNet. It means the attention mechanism is only applied to the high layer of the VM, ignoring the internal image dependencies in the dense features at multiple scales. Therefore, the results in the VM become the performance bottleneck. Meanwhile, the visual and language features of these methods reside in their own space. In this way, it ignores the alignment before fusion, leading to a failure to achieve maximum information interaction. To address these issues, we propose Visual cOllaboration and duaL-stream fusion for scene TExt Recognition, VOLTER for short. Firstly, a multi-stage Local-Global Collaboration Vision Model (LGC-VM) is constructed to focus on both local and global features at multiple scales, breaking vision bottlenecks to provide a better vision prediction. Secondly, to explicitly align the feature space of VM and LM, we introduce a Vision-Language Contrastive (VLC) module by encouraging positive vision-language pairs to have similar representations. Moreover, a Dual-Stream Feature Enhancement (DSFE) module is proposed for the unidirectional interaction of visual-language features to alleviate the alignment problem of different modalities and execute fusion further. Extensive experiments on benchmark datasets demonstrate that the proposed framework can achieve state-of-the-art performance.

Keywords:

Computer science Bottleneck Artificial intelligence Benchmark (surveying) Fusion mechanism Feature (linguistics) Dual (grammatical number) Pattern recognition (psychology) Computer vision Fusion

Metrics

Cited By

2.12

FWCI (Field Weighted Citation Impact)

Refs

0.78

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Visual-Semantic Dual-Decoder Collaboration for Scene Text Recognition

Scene Text Recognition With Dual Encoders

Dual-Stream Based Scene Text Manipulation Detection Method

DSNet: A End‐to‐End Scene Text Spotting Network With Dual‐Stream Feature Fusion

Scene recognition with audio-visual sensor fusion