Skip to PREreview
Live Review

PREreview of All You Need Is Context: Clinician Evaluations of various iterations of a Large Language Model-Based First Aid Decision Support Tool in Ghana

Published
DOI
10.5281/zenodo.13274625
License
CC BY 4.0

This review is the result of a virtual, collaborative live review discussion organized and hosted by PREreview and JMIR Publications on June 20, 2024. The discussion was joined by 15 people: 2 facilitators, 2 members of the JMIR Publications team, 2 authors, and 9 live review participants including 3 who agreed to be named: Aswathi Surendran, Khushboo Thaker, Arya Rahgozar, and Emmanuel Adamolekun but did not contribute to the final composition of this review. The authors of this review have dedicated additional asynchronous time over the course of two weeks to help compose this final report using the notes from the Live Review. We thank all participants who contributed to the discussion and made it possible for us to provide feedback on this preprint.

Summary

This study investigates the performance and application of Large Language Models (LLMs) as support tools for making clinical decisions during medical emergencies, in the resource-constrained settings of Low-and-Middle-Income Countries (LMICs) such as Ghana. The research's aim is to provide a premise for future research and development of LLM-based clinical decision support tools by assessing the suitability and effectiveness of five selected generalized LLMs using context-specific prompts. Thirteen medical experts with an average of three years of experience working in an environment of limited resources, evaluated the outputs of these models quantitatively by using mean ranking scores, and qualitatively using thematic analysis.

The authors used off-the-shelf pre-trained LLMs (GPT-4 Turbo, Gemini 1.5 Pro, and Claude Sonnet) with Prompt Engineering and Retrieval Augmented Generation (RAG) techniques to develop five iterations of a decision support tool. Fifty responses were generated and evaluated. Machine evaluations were also performed and compared with theirs, using conventional machine learning metrics like BLEU and ROUGE.

Their findings showed that Gemini 1.5 Pro + Prompt Engineering outperformed the other LLMs used in their research, while the adjustments of other LLMs using suitable parameters improved their overall performance. This may imply that LLM-based first aid assistants could provide useful instructions for the management and treatment of medical conditions, most especially in resource-constrained settings. The practitioners were generally satisfied with the diagnoses and instructions from these LLMs, demonstrating their potential and importance in managing medical emergencies. Future research should involve larger datasets, additional metrics, and more detailed evaluations to refine and enhance the use of LLMs in real-world medical emergencies.

The discussion from participants of this live review is summarized below.

List of major concerns and feedback

Statistical Significance of Differences in Mean Ranking Scores

  • Concern: The paper does not assess if the difference in mean ranking scores with a change in RAG approach (result in Table 2) is statistically significant.

  • Feedback: Perform statistical tests such t-tests or Kruskal–Wallis test by ranks to determine if the differences in mean ranking scores are statistically significant. This will add robustness to the findings.

Incomplete Figures

  • Concern: Fig 2 image is incomplete, with the right side cut off, and Fig 1 legend is incomplete. Figure 3: The data is not clear to assess the correlation.

  • Feedback: Revise the figures to ensure they are complete and clearly labeled. This will improve the clarity and comprehensibility of the visual data.

Availability of Google Form Reference

  • Concern: The Google form (reference 15) is not available.

  • Feedback: Ensure the Google form is accessible in the supplementary files. This is crucial for transparency and reproducibility.

List of minor concerns and feedback

  • It would be helpful for the reader to see the aim of the work, the main results and the conclusion mentioned in the abstract.

  • Participants were a bit confused about Reference 1 in the authors section and wondered if that was the most appropriate place to cite the project involved with this study.

  • It is unclear if Claude 3.5 Sonnet or Claude 3 Opus was used. Please clarify.

  • It is unclear what it is referred to with “Low-and Low-Middle-Income countries (LMICs)” Is it Low Income Countries (LICs) or “Lower Middle Income Countries (LMICs)”, forms more commonly used as defined by the World Bank?

  • In Section E of the Methodology it would be helpful to mention the total number of clinicians involved in the study. In section G the text says “The first group of 30 responses were evaluated by all 13 physicians. The second group of 20 responses was evaluated by 8 of the physicians. It would be helpful to know why and how these 8 were selected out of the total 13.

  • In Section F of the Methodology section, the text presents a quote by one of the clinician involved. It would be helpful to understand why this quote is presented in the text.

  • It would be helpful to have more information about the statistical tests used for the quantitative analysis and why.

  • In the Results section there seems to be inconsistency in the labeling style of tables: Roman numerals in the text versus Arabic numerals in the figure label. It would be helpful to choose one style and be consistent throughout the manuscript so that the reader can better follow the results.

  • In the Results section, under the qualitative analysis section, the sentence “Table 3 shows the 8 codes and their descriptions.” Table 3 should be corrected to Table 4. 

  • Figure 1 is a bit hard to read and understand. A bigger font and an explanation of what is plotted in the figure legend would significantly enhance comprehension.

  • In the second paragraph on page 6 the abbreviation EMS is first mentioned and it should be spelled out as the Emergency Medical Services (EMS).

  • It was expected that the RAG based approach would have performed better than the approach solely based on LLM. It would be helpful if the authors discussed the results in the context of these expectations, highlighting potential limitations of the study.

Concluding remarks

We thank the authors of the preprint for posting their work openly for feedback. We also thank all participants of the Live Review call for their time and for engaging in the lively discussion that generated this review.

Competing interests

The authors declare that they have no competing interests.