JOURNAL ARTICLE

Document processing for automatic knowledge acquisition

Yuan TangChang YanChing Y. Suen

Year: 1994 Journal:   IEEE Transactions on Knowledge and Data Engineering Vol: 6 (1)Pages: 3-21   Publisher: IEEE Computer Society

Abstract

The knowledge acquisition bottleneck has become the major impediment to the development and application of effective information systems. To remove this bottleneck, new document processing techniques must be introduced to automatically acquire knowledge from various types of documents. By presenting a survey on the techniques and problems involved, this paper aims at serving as a catalyst to stimulate research in automatic knowledge acquisition through document processing. In this study, a document is considered to have two structures: geometric structure and logical structure. These play a key role in the process of the knowledge acquisition, which can be viewed as a process of acquiring the above structures. Extracting the geometric structure from a document refers to document analysis; mapping the geometric structure into logical structure is regarded as document understanding. Both areas are described in this paper, and the basic concept of document structure and its measurement based on entropy analysis is introduced. Logical structure and geometric models are proposed. Both top-down and bottom-up approaches and their entropy analyses are presented. Different techniques are discussed with practical examples. Mapping methods, such as tree transformation, document formatting knowledge and document format description language, are described.< >

Keywords:
Computer science Disk formatting Bottleneck Knowledge acquisition Information retrieval Document processing Process (computing) Tree structure Document Structure Description Data structure Document layout analysis Well-formed document Knowledge extraction Data mining Natural language processing Artificial intelligence Document type definition XML World Wide Web Programming language

Metrics

90
Cited By
7.27
FWCI (Field Weighted Citation Impact)
92
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Processing and 3D Reconstruction
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Semantic Web and Ontologies
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.