Vision and natural language for metadata extraction from scientific PDF documents

Zeyd Boukhers; Azeddine Bouabdallah

doi:10.1145/3529372.3533295

ScienceGate Book Chapters

JOURNAL ARTICLE

Vision and natural language for metadata extraction from scientific PDF documents

Zeyd Boukhers Azeddine Bouabdallah

Year: 2022 Pages: 1-5

DOI: 10.1145/3529372.3533295

Get Full-Text PDF Get Analytical Report

Abstract

The challenge of automatically extracting metadata from scientific PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German social sciences, the authors are not required to generate their papers according to a specific template and they often create their own templates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always effective which is reflected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientific PDF documents. The aim is to benefit from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its effectiveness over unimodal models, with an overall F1 score of 92.3%.

Keywords:

Metadata Computer science German Information retrieval Modalities Unavailability Natural language processing Diversity (politics) Information extraction Artificial neural network World Wide Web Artificial intelligence Linguistics Statistics

Metrics

Cited By

2.28

FWCI (Field Weighted Citation Impact)

Refs

0.88

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Web Data Mining and Analysis

Physical Sciences → Computer Science → Information Systems

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Mathematics, Computing, and Information Processing

Physical Sciences → Computer Science → Computational Theory and Mathematics

Vision and natural language for metadata extraction from scientific PDF documents

Abstract

Metrics

Citation History

Topics

Related Documents

AUTOMATIC METADATA EXTRACTION FROM SCIENTIFIC PDF DOCUMENTS

Potential of natural language processing for metadata extraction from environmental scientific publications

Metadata Extraction from Office Documents

Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing

Figure Metadata Extraction from Digital Documents