JOURNAL ARTICLE

GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

Yingnan ZhaoHu ZhengF. DingJielin JiangXiaolong Xu

Year: 2025 Journal:   Complex & Intelligent Systems Vol: 11 (8)   Publisher: Springer Science+Business Media

Abstract

Abstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable global context modeling via self-attention mechanisms, a computational framework designed for establishing effective long-range dependency modeling, their practical implementation is constrained by quadratic computational complexity in real-world scenarios. To address these challenges, this study proposes a novel scene text detector called GDText-VM (Globally Deformable Text-VMamba), developed using the deformable VMamba framework. This detector incorporates a global channel-spatial attention mechanism along with Fourier contour modeling. This approach enhances the capability to capture long-range dependencies, achieving a global receptive field and rapid convergence while maintaining linear computational complexity. Unlike the original VMamba, GDText-VM integrates deformable convolutions to enhance focus on local regions and reduces reliance on cross-shaped activation patterns. Additionally, to improve the capability of GDText-VM to fit text contours in the Fourier domain, this study introduces an innovative Global Attention Shuffle Module (GASM). This module facilitates the fusion of global channel and spatial features, effectively mitigating the impact of feature imbalance on fitting performance and significantly enhancing text detection accuracy. This study conducts comprehensive experiments on Total-Text, CTW1500, and ICDAR2015 to compare GDText-VM with classical scene text detection approaches. The results indicate that GDText-VM outperforms the state-of-the-art methods in terms of precision, recall, and F-measure, while maintaining efficient computation with 25.88M parameters and 40.83G FLOPs. Notably, GDText-VM achieves F-measure values of 88.5% on Total-Text, 88.9% on CTW1500, and 88.6% on ICDAR2015.

Keywords:
Computational intelligence Detector Computer vision Artificial intelligence Computer science Physics Computer graphics (images) Optics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
48
Refs
0.18
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.