JOURNAL ARTICLE

Extraction of Social Determinants of Health From Electronic Health Records Using Natural Language Processing

Z ChenPatricia LasserreAngela LinRasika Rajapakshe

Year: 2025 Journal:   JCO Clinical Cancer Informatics Vol: 9 (9)Pages: e2400317-e2400317   Publisher: Lippincott Williams & Wilkins

Abstract

PURPOSE Social Determinants of Health (SDoH) have a significant effect on health outcomes and inequalities. SDoH can be extracted from electronic health records (EHR) to aid policy development and research to improve population health. Automated extraction using artificial intelligence (AI) can improve efficiency and cost-effectiveness. The focus of this study was to autonomously extract comprehensive SDoH details from EHR using a natural language processing (NLP)–based AI pipeline. MATERIALS AND METHODS A curated set of 1,000 BC Cancer clinical documents with concentrated SDoH information served as the reference standard for training and evaluating NLP models. Two pipelines were used: an open-source pipeline trained on the annotated medical documents and an industrial pretrained solution used as a benchmark. Three experiments optimized the first pipeline's performance, assessing the effect of including subtype word positions during training. The superior open-source pipeline was then used to extract SDoH information from 13,258 oncology documents. RESULTS The open-source pipeline achieved an average F1 score accuracy of 0.88 on the validation data set for extracting 13 SDoH factors, surpassing the benchmark by 5%. It excelled in detailed subtype extraction, while the benchmark performed better in identifying rarely annotated SDoH information in BC Cancer data set. Overall, 60,717 SDoH factors and associated details were extracted from BC Cancer EHR oncology documents. The most frequently extracted SDoH factors included tobacco use, employment status, marital status, alcohol consumption, and living status, occurring between 8k to 12k times. CONCLUSION This study demonstrates the potential of an NLP pipeline to extract SDoH factors from clinical notes, with strong performance on limited data, although data set–specific adjustments are needed for broader application across institutions.

Keywords:
Pipeline (software) Benchmark (surveying) Artificial intelligence Computer science Social determinants of health Natural language processing Machine learning Social media Medicine Public health World Wide Web Nursing Geography

Metrics

2
Cited By
17.48
FWCI (Field Weighted Citation Impact)
11
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Food Security and Health in Diverse Populations
Health Sciences →  Health Professions →  General Health Professions
Health Promotion and Cardiovascular Prevention
Health Sciences →  Medicine →  Public Health, Environmental and Occupational Health
Chronic Disease Management Strategies
Health Sciences →  Medicine →  Epidemiology

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.