Evaluating large language models on a highly-specialized topic, radiation oncology physics

Jason Holmes; Zhengliang Liu; Lian Zhang; Yuzhen Ding; Terence T. Sio; L.A. McGee; Jonathan B. Ashman; Xiang Li; Tianming Liu; Jiajian Shen; Wei Liu

doi:10.3389/fonc.2023.1219326

JOURNAL ARTICLE

Evaluating large language models on a highly-specialized topic, radiation oncology physics

Jason Holmes Zhengliang Liu Lian Zhang Yuzhen Ding Terence T. Sio L.A. McGee Jonathan B. Ashman Xiang Li Tianming Liu Jiajian Shen Wei Liu

Year: 2023 Journal: Frontiers in Oncology Vol: 13 Pages: 1219326-1219326 Publisher: Frontiers Media

DOI: 10.3389/fonc.2023.1219326

Get Full-Text PDF Get Analytical Report

Abstract

Purpose We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs. Methods We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with “None of the above choices is the correct answer.”). A majority vote analysis was used to approximate how well each group could score when working together. Results ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote. Conclusion This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.

Keywords:

Consistency (knowledge bases) Test (biology) Radiation oncology Benchmark (surveying) Medical education Psychology Medical physics Oncology Medicine Internal medicine Computer science Biology Radiation therapy Artificial intelligence Ecology

Metrics

128

Cited By

32.70

FWCI (Field Weighted Citation Impact)

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Artificial Intelligence in Healthcare and Education

Health Sciences → Medicine → Health Informatics

Explainable Artificial Intelligence (XAI)

Physical Sciences → Computer Science → Artificial Intelligence

Evaluating large language models on a highly-specialized topic, radiation oncology physics

Abstract

Metrics

Citation History

Topics

Related Documents

FINE-TUNING LARGE LANGUAGE MODELS FOR RADIATION ONCOLOGY, A HIGHLY SPECIALIZED HEALTHCARE DOMAIN

Fine-Tuning Large Language Models for Radiation Oncology, A Specialized Health Care Domain

Evaluating Large Language Models for Narrative Topic Labeling

Large Language Models in Radiation Oncology Physics Education: Performance Assessment Using the 2024 RAPHEX Exam

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models