Write a PREreview

Evaluating the accuracy and reliability of large language models in assisting with pediatric differential diagnoses: A multicenter diagnostic study

by Masab A. Mansoor, Andrew F. Ibrahim, David J. Grindem, and Asad Baig

Posted: August 10, 2024
Server: medRxiv
DOI: 10.1101/2024.08.09.24311777

Importance

Large language models, such as GPT-3, have shown potential in assisting with clinical decision-making, but their accuracy and reliability in pediatric differential diagnosis in rural healthcare settings remain underexplored.

Objective

Evaluate the performance of a fine-tuned GPT-3 model in assisting with pediatric differential diagnosis in rural healthcare settings and compare its accuracy to human physicians.

Methods

Retrospective cohort study using data from a multicenter rural pediatric healthcare organization in Central Louisiana serving approximately 15,000 patients. Data from 500 pediatric patient encounters (age range: 0-18 years) between March 2023 and January 2024 were collected and split into training (70%, n=350) and testing (30%, n=150) sets.

Interventions

GPT-3 model (DaVinci version) fine-tuned using OpenAI API on training data for ten epochs.

Main Outcomes and Measures

Accuracy of fine-tuned GPT-3 model in generating differential diagnoses, evaluated using sensitivity, specificity, precision, F1 score, and overall accuracy. The model’s performance was compared to human physicians on the testing set.

Results

The fine-tuned GPT-3 model achieved an accuracy of 87% (131/150) on the testing set, with a sensitivity of 85%, specificity of 90%, precision of 88%, and F1 score of 0.87. The model’s performance was comparable to human physicians (accuracy 91%; P = .47).

Conclusions and Relevance

The fine-tuned GPT-3 model demonstrated high accuracy and reliability in assisting with pediatric differential diagnosis, with performance comparable to human physicians. Large language models could be valuable tools for supporting clinical decision-making in resource-constrained environments. Further research should explore implementation in various clinical workflows.

You can write a PREreview of Evaluating the accuracy and reliability of large language models in assisting with pediatric differential diagnoses: A multicenter diagnostic study. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.