Optical character recognition errors and their effects on natural language processing

Daniel Lopresti

doi:10.1145/1390749.1390753

ScienceGate Book Chapters

JOURNAL ARTICLE

Optical character recognition errors and their effects on natural language processing

Daniel Lopresti

Year: 2008 Pages: 9-16

DOI: 10.1145/1390749.1390753

Get Full-Text PDF Get Analytical Report

Abstract

Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to down-stream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. The problem of identifying tabular structures that should not be parsed as sentential text is also discussed.

Keywords:

Computer science Pipeline (software) Lexical analysis Parsing Character (mathematics) Optical character recognition Sentence Artificial intelligence Natural language processing Speech recognition Noise (video) Error detection and correction Natural language Pattern recognition (psychology) Algorithm Programming language Image (mathematics)

Metrics

Cited By

3.99

FWCI (Field Weighted Citation Impact)

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Optical character recognition errors and their effects on natural language processing

Abstract

Metrics

Citation History

Topics

Related Documents

Optical character recognition errors and their effects on natural language processing

Refining optical character recognition results with natural language processing techniques

Text Post-processing on Optical Character Recognition output using Natural Language Processing Methods

A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing

Attention-Based Deep Learning Algorithm in Natural Language Processing for Optical Character Recognition