Metadata Extraction from Office Documents

William K. Stumbo; John C. Handley

doi:10.2352/issn.2168-3204.2005.2.1.art00040

ScienceGate Book Chapters

JOURNAL ARTICLE

Metadata Extraction from Office Documents

William K. Stumbo John C. Handley

Year: 2005 Journal: Archiving Conference Vol: 2 (1)Pages: 184-187

DOI: 10.2352/issn.2168-3204.2005.2.1.art00040

Get Full-Text PDF Get Analytical Report

Abstract

This paper focuses on using layout-based techniques to automatically extract metadata when scanning office documents to an archive. Many office documents such as letters, inter-office memos, and invoices contain key information that is spatially arranged. Information arrayed in this manner is easy for a reader to identify and understand. However, location of information within office documents varies greatly between documents, unlike forms where layout is static. This poses a challenge for layout based metadata extraction techniques. Our system uses regular expression matching and stochastic grammars on lines of text to efficiently and accurately label text according to function, enabling archived documents to be precisely retrieved.

Keywords:

Metadata Computer science Information retrieval Function (biology) Key (lock) Matching (statistics) Information extraction World Wide Web

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.48

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Image Processing and 3D Reconstruction

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Metadata Extraction from Office Documents

Abstract

Metrics

Topics

Related Documents

Figure Metadata Extraction from Digital Documents

Metadata Extraction from Semi-structured Email Documents

AUTOMATIC METADATA EXTRACTION FROM SCIENTIFIC PDF DOCUMENTS

Techniques for Extraction of Metadata from Heritage Documents

Automatic extraction of table metadata from digital documents