JOURNAL ARTICLE

Inferring high-fat dietary patterns from electronic health record data using machine learning

Abstract

Abstract Objectives Electronic health records (EHRs) rarely capture dietary detail, limiting diet–disease research. We aimed to develop machine learning (ML) computable phenotypes to identify high-fat diet (HFD) using variables typically available in EHRs. Materials and Methods We used National Health and Nutrition Examination Survey (NHANES) 1999-2020 data, where 24-h dietary recall served as ground truth. Dietary fat intake was summarized into a score (0-30) based on percent energy from fat, carbohydrate, and protein; lower scores indicated HFD. We defined HFD at cutoffs of 10, 15, and 20, and trained ML models (Extreme Gradient Boosting, logistic regression, random forest) using EHR-compatible variables (demographics, comorbidities, labs, anthropometrics). Model interpretability was assessed using Shapley Additive Explanations. To evaluate clinical relevance, we compared cancer associations using ML-predicted vs true diet labels. Results Machine learning models classified HFD with good performance, strongest at broader definitions. Random forest achieved an F1-score of 0.79 (recall 0.74, precision 0.84) at cutoff 20. Key predictors included race/ethnicity, triglycerides, obesity metrics (body mass index and derived indices), and metabolic panel results. Discussion These findings indicate that dietary patterns, though seldom recorded in EHRs, can be inferred from routinely available variables. The ability of ML-derived phenotypes to reproduce known diet–disease relationships underscore their epidemiologic validity. Top predictors also align with established biological pathways linking obesity, lipid metabolism, and cancer risk, supporting plausibility. Conclusion A high-fat dietary pattern can be inferred from EHR-compatible variables using ML-based phenotyping. This approach offers a scalable tool to integrate diet into EHR-based research and precision medicine.

Keywords:
Interpretability Random forest Logistic regression National Health and Nutrition Examination Survey Recall Deep learning Obesity Scalability Health records

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
30
Refs
0.83
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Nutritional Studies and Diet
Health Sciences →  Medicine →  Public Health, Environmental and Occupational Health
Diet and metabolism studies
Health Sciences →  Medicine →  Physiology
Nutrition, Genetics, and Disease
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Genetics
© 2026 ScienceGate Book Chapters — All rights reserved.