GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

Yingnan Zhao; Hu Zheng; F. Ding; Jielin Jiang; Xiaolong Xu

doi:10.1007/s40747-025-01987-6

ScienceGate Book Chapters

JOURNAL ARTICLE

GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

Yingnan Zhao Hu Zheng F. Ding Jielin Jiang Xiaolong Xu

Year: 2025 Journal: Complex & Intelligent Systems Vol: 11 (8) Publisher: Springer Science+Business Media

DOI: 10.1007/s40747-025-01987-6

Get Full-Text PDF Get Analytical Report

Abstract

Abstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable global context modeling via self-attention mechanisms, a computational framework designed for establishing effective long-range dependency modeling, their practical implementation is constrained by quadratic computational complexity in real-world scenarios. To address these challenges, this study proposes a novel scene text detector called GDText-VM (Globally Deformable Text-VMamba), developed using the deformable VMamba framework. This detector incorporates a global channel-spatial attention mechanism along with Fourier contour modeling. This approach enhances the capability to capture long-range dependencies, achieving a global receptive field and rapid convergence while maintaining linear computational complexity. Unlike the original VMamba, GDText-VM integrates deformable convolutions to enhance focus on local regions and reduces reliance on cross-shaped activation patterns. Additionally, to improve the capability of GDText-VM to fit text contours in the Fourier domain, this study introduces an innovative Global Attention Shuffle Module (GASM). This module facilitates the fusion of global channel and spatial features, effectively mitigating the impact of feature imbalance on fitting performance and significantly enhancing text detection accuracy. This study conducts comprehensive experiments on Total-Text, CTW1500, and ICDAR2015 to compare GDText-VM with classical scene text detection approaches. The results indicate that GDText-VM outperforms the state-of-the-art methods in terms of precision, recall, and F-measure, while maintaining efficient computation with 25.88M parameters and 40.83G FLOPs. Notably, GDText-VM achieves F-measure values of 88.5% on Total-Text, 88.9% on CTW1500, and 88.6% on ICDAR2015.

Keywords:

Computational intelligence Detector Computer vision Artificial intelligence Computer science Physics Computer graphics (images) Optics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.18

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

Abstract

Metrics

Topics

Related Documents

UTextNet: A UNet Based Arbitrary Shaped Scene Text Detector

Arbitrary-Shaped Scene Text Recognition with Deformable Ensemble Attention

Boundary-Aware Arbitrary-Shaped Scene Text Detector With Learnable Embedding Network

CDText: Scene text detector based on context-aware deformable transformer

DA-STD: Deformable Attention-Based Scene Text Detection in Arbitrary Shape