Improving transformer for scene text and handwritten text recognition

Yew Lee Tan

doi:10.32657/10356/178284

ScienceGate Book Chapters

DISSERTATION

Improving transformer for scene text and handwritten text recognition

Yew Lee Tan

Year: 2024

DOI: 10.32657/10356/178284

Get Full-Text PDF Get Analytical Report

Abstract

Scene text recognition (STR) involves reading text from images of natural scenes. The texts in such images come in a wide array of fonts, shapes, and orientations. Therefore, various works often rely on rectification network to rectify text images before passing them to the recognition network. However, rectifying an image that does not require it may create unwanted distortion. This may result in wrong predictions that would have otherwise been correct. In order to alleviate the adverse impact of rectification, a portmanteauing of features is presented. The method is introduced to a transformer-based model through a proposed block matrix initialisation which achieved competitive results. Although transformer has achieved notable success in various fields, areas for improvements with its application in STR were identified in this study. Firstly, vision transformer requires an input image to be resized into fixed height and width before being split into patches. However, it was discovered that certain patch resolutions seem to result in better accuracy for images with particular original aspect ratios. Secondly, the first decoded character generally has lower accuracy. In view of these issues, pure transformer with integrated experts (PTIE) is proposed. PTIE is able to process multiple patch resolutions and decode in both the original and reverse character orders thereby capitalising on the aforementioned areas and achieved state-of-the-art results. Handwritten text recognition (HTR) deals with handwritten text images that come from scanned or photograph documents. Works that employed transformer-based models often train them with additional synthetic data. However, these data are not publicly available. Furthermore, experimentation in this study seems to suggest that transformer trained on real HTR data generalises poorly to unseen data. Therefore, in the scope of real data, a simple transformer model which outperformed related works is presented in this thesis. This is achieved by adopting attention masking which addressed the issue of generalisation as well as introducing various pre-processing methods.

Keywords:

Transformer Computer science Text recognition Speech recognition Natural language processing Artificial intelligence Pattern recognition (psychology) Engineering Electrical engineering Image (mathematics)

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Advanced Scientific Research Methods

Life Sciences → Agricultural and Biological Sciences → Food Science

Improving transformer for scene text and handwritten text recognition

Abstract

Metrics

Topics

Related Documents

ResNet50+Transformer: kazakh offline handwritten text recognition

TransText: Improving scene text detection via transformer

Display-Semantic Transformer for Scene Text Recognition

Lightweight Scene Text Recognition Based on Transformer

Compressed Vision Transformer for Scene Text Recognition