JOURNAL ARTICLE

Text Post-processing on Optical Character Recognition output using Natural Language Processing Methods

Abstract

The technique of turning images of printed or written text from scanned documents, images of documents, or simple photos into machine-encoded text is known as optical character recognition (OCR). OCR has proven to be very useful in terms of digitizing documents and making them easier to analyze. Despite the advancement in the technology since it was introduced, there are still areas OCR falls short. If either the written text is illegible, or the OCR software isn't powerful enough, it results in inaccurate translations. This research work aims at addressing this shortcoming by performing post-processing on OCR outputs primarily using Transformers such as BERT in a two-step pipeline to correct these mistakes and improve the quality of the document.

Keywords:
Computer science Character (mathematics) Natural language processing Character recognition Optical character recognition Artificial intelligence Speech recognition Natural (archaeology) Signal processing Digital signal processing Computer hardware Mathematics

Metrics

3
Cited By
0.55
FWCI (Field Weighted Citation Impact)
20
Refs
0.63
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Text Analysis Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Vehicle License Plate Recognition
Physical Sciences →  Engineering →  Media Technology
© 2026 ScienceGate Book Chapters — All rights reserved.