JOURNAL ARTICLE

Retrieval augmented generation for building datasets from scientific literature

Piyush Ranjan MaharanaAshwini VermaKavita Joshi

Year: 2025 Journal:   Journal of Physics Materials Vol: 8 (3)Pages: 035006-035006   Publisher: IOP Publishing

Abstract

Abstract In this work, we show that employing retrieval augmented generation (RAG) with a large language model (LLM) enables us to extract accurate data from scientific literature and construct datasets. The rapid growth in publications necessitates the automation of extraction of structured data as it is crucial for training machine learning(ML) models. The pipeline developed is simple and can be adjusted accordingly with natural language as input. Quantization enables us to run LLMs on consumer hardware and remove the reliance on closed-source models. Both Llama3-8B and Gemma2-9B with RAG give structured output consistently and with high accuracy as compared to direct prompting. Using the newly developed protocol, we created a data set of metal hydrides for solid-state hydrogen storage from paper abstracts. The accuracy of the generated dataset was > 88% in the cases tested. Further, we demonstrate that the generated dataset is ready-to-use for ML models by testing it with HYST to predict the H _{2}wt\% at a given temperature. Thus, we demonstrate a pipeline to create datasets from scientific literature at minimal computational cost and high accuracy.

Keywords:
Information retrieval Computer science

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
44
Refs
0.07
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Semantic Web and Ontologies
Physical Sciences →  Computer Science →  Artificial Intelligence
Data Quality and Management
Social Sciences →  Decision Sciences →  Management Science and Operations Research
© 2026 ScienceGate Book Chapters — All rights reserved.