DISSERTATION

Improving transformer for scene text and handwritten text recognition

Abstract

Scene text recognition (STR) involves reading text from images of natural scenes. The texts in such images come in a wide array of fonts, shapes, and orientations. Therefore, various works often rely on rectification network to rectify text images before passing them to the recognition network. However, rectifying an image that does not require it may create unwanted distortion. This may result in wrong predictions that would have otherwise been correct. In order to alleviate the adverse impact of rectification, a portmanteauing of features is presented. The method is introduced to a transformer-based model through a proposed block matrix initialisation which achieved competitive results. Although transformer has achieved notable success in various fields, areas for improvements with its application in STR were identified in this study. Firstly, vision transformer requires an input image to be resized into fixed height and width before being split into patches. However, it was discovered that certain patch resolutions seem to result in better accuracy for images with particular original aspect ratios. Secondly, the first decoded character generally has lower accuracy. In view of these issues, pure transformer with integrated experts (PTIE) is proposed. PTIE is able to process multiple patch resolutions and decode in both the original and reverse character orders thereby capitalising on the aforementioned areas and achieved state-of-the-art results. Handwritten text recognition (HTR) deals with handwritten text images that come from scanned or photograph documents. Works that employed transformer-based models often train them with additional synthetic data. However, these data are not publicly available. Furthermore, experimentation in this study seems to suggest that transformer trained on real HTR data generalises poorly to unseen data. Therefore, in the scope of real data, a simple transformer model which outperformed related works is presented in this thesis. This is achieved by adopting attention masking which addressed the issue of generalisation as well as introducing various pre-processing methods.

Keywords:
Transformer Computer science Text recognition Speech recognition Natural language processing Artificial intelligence Pattern recognition (psychology) Engineering Electrical engineering Image (mathematics)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Advanced Scientific Research Methods
Life Sciences →  Agricultural and Biological Sciences →  Food Science

Related Documents

JOURNAL ARTICLE

ResNet50+Transformer: kazakh offline handwritten text recognition

Y. AmirgaliyevMateus MendesK. MukhtarR. JantayevCh. Kenchimov

Journal:   Bulletin of the National Engineering Academy of the Republic of Kazakhstan Year: 2022 Vol: 84 (2)Pages: 11-24
JOURNAL ARTICLE

TransText: Improving scene text detection via transformer

Jiajun ZhuGuodong Wang

Journal:   Digital Signal Processing Year: 2022 Vol: 130 Pages: 103698-103698
JOURNAL ARTICLE

Display-Semantic Transformer for Scene Text Recognition

Xinqi YangWushour SilamuMiaomiao XuYanbing Li

Journal:   Sensors Year: 2023 Vol: 23 (19)Pages: 8159-8159
JOURNAL ARTICLE

Lightweight Scene Text Recognition Based on Transformer

Xin LuanJinwei ZhangMiaomiao XuWushouer SilamuYanbing Li

Journal:   Sensors Year: 2023 Vol: 23 (9)Pages: 4490-4490
© 2026 ScienceGate Book Chapters — All rights reserved.