Digitizing printed document is always a challenge faced by the computing society. Digitization of text not only allows users to easily modify and reprint printed documents, but also is a need of the day due to the use of word-search capability available at disposal in this era. Converting a printed document into a stream of characters using OCR (optical character recognition) techniques is a widely researched area of the past and there are a number of well established algorithms available in the literature to do so. However, the idea of preserving the formatting information of the original document is not much studied. The contribution of this paper is of two folds: (1) applying known OCR techniques to one of Sri Lanka's native languages, Sinhala, and addressing the challenges in doing so and (2) maintaining a number of selected formatting features of the printed document in the converted editable text. Therefore, this paper outlines the design and implementation of a software system that converts a scanned paper document written in Sinhala language into formatted editable text and describes how this system is integrated into an open-source word processing tool.
Katarzyna Węgrzyn-WolskaPiotr S. Szczepaniak
Selena HeMeng HanNidhibahen PatelZhigang Li