JOURNAL ARTICLE

Vision and natural language for metadata extraction from scientific PDF documents

Abstract

The challenge of automatically extracting metadata from scientific PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German social sciences, the authors are not required to generate their papers according to a specific template and they often create their own templates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always effective which is reflected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientific PDF documents. The aim is to benefit from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its effectiveness over unimodal models, with an overall F1 score of 92.3%.

Keywords:
Metadata Computer science German Information retrieval Modalities Unavailability Natural language processing Diversity (politics) Information extraction Artificial neural network World Wide Web Artificial intelligence Linguistics Statistics

Metrics

6
Cited By
2.28
FWCI (Field Weighted Citation Impact)
18
Refs
0.88
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Web Data Mining and Analysis
Physical Sciences →  Computer Science →  Information Systems
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Mathematics, Computing, and Information Processing
Physical Sciences →  Computer Science →  Computational Theory and Mathematics
© 2026 ScienceGate Book Chapters — All rights reserved.