JOURNAL ARTICLE

Scene Text Recognition with Transformer using Multi-patches

Yao WangJong-Eun Ha

Year: 2022 Journal:   Journal of Institute of Control Robotics and Systems Vol: 28 (10)Pages: 862-867

Abstract

In this paper, we explore the application of Vision transformer (ViT) to the scene text recognition task. As a popular research direction in computer vision, Scene text recognition enables computers to recognize or read the text in natural scenes, such as object labels, text descriptions, and road text signs. At present, the traditional convolutional neural network-based model has better performance. Still, in the face of complex backgrounds and irregular scene text pictures, the performance of the convolutional neural network-based model is challenging to improve in curved text, diverse fonts, distortions, etc. With the application of transformers in computer vision, the model structure based on transformers has also significantly been developed. Although the current transformer-based model can obtain the performance of the model structure similar to CNN, it is currently in the early stage of application, and there is much room for research and improvement. We propose a multi-scale vertical rectangular patch model (MSVSTR) for transformer-based feature extractor to be more suitable for text images. By only arranging the patches in a single direction, when the image is cropped through the patch, it can be more suitable for the distribution form of the text in the text image. At the same time, to be suitable for different numbers of characters in other texts and more robust feature extraction, vertical rectangular patches of different scales are applied to crop the image. Our structure performs better through various ablation experiments than similar transformer-based STR models. At the same time, experiments show that our structure can perform seven benchmarks well.

Keywords:
Computer science Transformer Convolutional neural network Artificial intelligence Extractor Feature extraction Pattern recognition (psychology) Computer vision Engineering Voltage

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.10
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Vehicle License Plate Recognition
Physical Sciences →  Engineering →  Media Technology

Related Documents

JOURNAL ARTICLE

Apvit: ViT with adaptive patches for scene text recognition

Ning ZhangCe LiZongshun WangJialin MaZhiqiang Feng

Journal:   Discover Applied Sciences Year: 2025 Vol: 7 (4)
JOURNAL ARTICLE

Scene Text Recognition with Multi-Encoders

Yao WangJong-Eun Ha

Journal:   2022 22nd International Conference on Control, Automation and Systems (ICCAS) Year: 2022 Pages: 1615-1620
JOURNAL ARTICLE

Scene Text Recognition with Multi-decoders

Yao WangJong-Eun Ha

Journal:   2021 21st International Conference on Control, Automation and Systems (ICCAS) Year: 2021 Pages: 1523-1528
JOURNAL ARTICLE

An adaptive n-gram transformer for multi-scale scene text recognition

Xueming YanZhihang FangYaochu Jin

Journal:   Knowledge-Based Systems Year: 2023 Vol: 280 Pages: 110964-110964
© 2026 ScienceGate Book Chapters — All rights reserved.